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Streptococcus pneumoniae Polynucleotides and Sequences 

FIELD OF THE INVENTION 

5 The present invention relates lo the field of molecuJar biology. In 

particular, it relates to, among other things, nucleotide sequences of Streptococcus 
pneumoniae, contigs, ORFs, fragments, probes, primers and related 
polynucleotides thereof, peptides and polypeptides encoded by the sequences, and 
uses of the polynucleotides and sequences thereof, such as in fermentation. 

10 polypeptide production, assays and pharmaceutical development, among others. 

BACKGROUND OF THE INVENTION 

Streptococcus pneumoniae has been one of the most extensively studied 

15 microorganisms since its first isolation in 1881. It was the object of many 
investigations that led to important scientific discoveries. In 1928, Griffith 
observed that when heat-killed encapsulated pneumococci and live strains 
consiitutively lacking any capsule were conconnitantiy injected into mice, the 
nonencapsulated could be converted into encapsulated pneumococci with the same 

20 capsular type as the heat-killcd strain. Years later, the nature of this "transforming 
principle." or carrier of genetic information, was .«thown to be DNA. (Avery, O.T., 
etaUJ, Exp, Med,, 79:137-157 (1944)). 

In spite of the vast number of publications on 5. pneumoniae many 
questions about its virulence are still unanswered, and this pathogen remains a 

25 niajor causative agent of serious human disease, especially community-acquired 
pneumonia. (Johnston, R.B., era/.. Rev, Infect. Dis. 7J(Suppl. 6):S509-5I7 
(1991)). In addition, in developing countries, the pneumococcus is responsible for 
the death of a large number of children under the age of 5 years from pneumococcal 
pneumonia. The incidence of pneumococcal disease is highest in infants under 2 

30 years of age and in people over 60 years of age. Pneumococci arc the second most 
frequent cause (after Haemophilus influenzae type b) of bacterial meningitis and 
otitis media in children. With the recent introduction of conjugate vaccines for H. 
influenzae type b, pneumococcal meningitis is likely to become increasingly 
prominent. S, pneumoniae is the most important etiologic agent of community- 
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acquired pneumonia in adults and is the second most common cause of bacterial 
meningitis behind Neisseria meningitidis. 

The antibiotic generally prescribed to treat S. pneumoniae is 
benzylpenicillin. although resistance to this and to other antibiotics is found 
5 occasionally. Pneumococcal resistance to peniciffin results from mutations in its 
penicillin-binding proteins. In uncomplicated pneumococcal pneumonia caused by 
a sensitive strain, treatment with penicillin is usually successful unless staned too 
late. Erythromycin or clindamycin can be used to treat pneumonia in patients 
hypersensitive to penicillin, but resistant strains to these drugs exist. Broad 
10 sp>ectrum antibiotics (e.g., the ictracychnes) may also be effective, although 
tetracycline-resistam strains are not rare, in spite of the availability of antibiotics, 
the mortality of pneumococcal bacteremia in the last four decades has remained 
stable between 25 and 29%. (Gillespie, S.H., et aL. J. Med, Microbiol 28:231- 
248 (1989). 

15 S. pneumoniae is carried in the upper respiratory tract by many healthy 

individuals. It has been suggested that attachment of pneumococci is mediated by a 
disaccharide receptor on fibronectin. present on human pharyngeal epithelial cells. 
(Anderson. BJ., era/., J. Immunol, y-/2:2464-2468 (1989). The mechanisms by 
which pneumococci translocate from the nasopharynx to the lung, thereby causing 

20 pneumonia, or migrate to the blood, giving rise to bacteremia or septicemia, are 
pooriy understood. (Johnston, R.B., et aL. Rev, Infect, Dis, ;i(Suppi. 6):S509- 
517 (1991). 

Various proteins have been suggested to be involved in the pathogenicity of 
5. pneumoniae, however, only a few of them have actually been confirmed as 

25 virulence factors. Pneumococci produce an IgAl protease that might interfere with 
host defense at mucosal surfaces. (Kornfield, SJ., era/.. Rev, Inf. Dis, 3:521- 
534 (1981). S, pneumoniae also produces neuraminidase, an enzyme that may 
facilitate attachment to epithelial cells by cleaving sialic acid from the host 
glycolipids and gangliosides. Partially purified neuraminidase was observed to 

30 induce meningitis-like symptoms in mice; however, the reliability of this finding 
has been questioned because the neuraminidase preparations used were probably 
contaminated with cell wall products. Other pneumococcal proteins besides 
neuraminidase are involved in the adhesion of pneumococci to epithelial and 
endothelial cells. These pneumococcal proteins have as yet not been identified. 

35 Recently, Cundell et. al, , reported that peptide permeases can modulate 
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pneumococcal adherence to epithelial and endothelial cells. It was. however, 
unclear whether these permeases function directly as adhesions or whether they 
enhance adherence by modulating the expression of pneumococcal adhesions. 
(DeVelasco, E.A., et aL. Micro. Rev, 59:591-603 (1995). A better understanding 
5 of the virulence factors determining its pathogenicity will need to be developed lo 
cope with the devastating effects of pneumococcal disease in humans. 

Ironically, despite the prominent role of 5. pneumoniae in the discovery of 
DNA, little is Icnown about the molecular genetics of the organism. The J. 
pneumoniae genome consists of one circular, covalently closed, double-stranded 

to DNA and a collection of so-called variable accessory elements, such as prophages, 
plasmids, transposons and the like. Most physical characteristics and almost ail of 
the genes of 5. pneumoniae are unknown. Among the few that have been 
identified, most have not been physically mapped or characterized in detail. Only a 
few genes of this organism have been sequenced. (See, for instance current 

IS versions of GENBANK and other nucleic acid databases, and references that relate 
to the genome of 5. pneumoniae such as those set out elsewhere herein.) 

It is clear that the etiology of diseases mediated or exacerbated by 5. 
pneumoniae, infection involves the programmed expression of 5. pneumoniae 
genes, and that characterizing the genes and their patterns of expression would add 

20 dramatically to our understanding of the organism and its host interactions. 
fCnowledge of 5. pneumoniae genes and genomic organization would improve our 
understanding of disease etiology and lead to improved and new ways of 
preventing, ameliorating, arresting and reversing diseases. Moreover, 
characterized genes and genomic fragments of 5. pneumoniae would provide 

25 reagents for, among other things, detecting, characterizing and controlling 5. 
pneumoniae infections. There is a need to characterize the genome of 5. 
pneumoniae and for polynucleotides of this organism. 
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SUMMARY OF THE INVENTION 

The present invention is based on the sequencing of fragments of the 
5 Streptococcus pneumoniae genome. The primary nucleotide sequences which were 
generated are provided in SEQ ID NOS: 1 -39 1 . 

The present invention provides the nucleotide sequence of severaJ hundred 
contigs of the Streptococcus pneumoniae genome, which arc listed in ubles below 
and set out in the Sequence Listing submitted herewith, and representative 
10 fragments thereof, in a form which can be readily used, analyzed, and interpreted 
by a skilled anisan. In one embodiment, the present invention is provided as 
contiguous strings of primary sequence information corresponding to the 
nucleotide sequences depicted in SEQ ID NOS: i -39 1. 

The present invention further provides nucleotide sequences which are at 
1 5 least 95% identical to the nucleotide sequences of SEQ ID NOS: 1-391. 

The nucleotide sequence of SEQ ED NOS: 1-39 1, a representative fragment 
thereof, or a nucleotide sequence which is at least 95% identical to the nucleotide 
sequence of SEQ ID NOS: 1-391 niay be provided in a variety of mediums to 
facilitate its use. In one application of this embodiment, the sequences of the 
20 present invention are recorded on computer readable media. Such media includes, 
but is not limited to: magnetic storage media, such as floppy discs, hard disc 
storage medium, and magnetic tape: optical storage media such as CD-ROM; 
elecuical storage media such as RAM and ROM: and hybrids of liiese categories 
such as magnetic/optical storage media. 
^5 The present invention further provides systems, particularly computer- 

based systems which contain the sequence information herein described stored in a 
data storage means. Such systems are designed to identify commercially important 
fragments of the Streptococcus pneumoniae genome. 

Another embodiment of the present invention is directed to fragments of the 
30 Streptococcus pneumoniae genome having particular structural or functional 
attributes. Such fragments of the Streptococcus pneumoniae genome of the present 
invention include, but arc not limited to, fragments which encode peptides, 
hereinafter referred to as open reading frames or ORFs, fragments which modulate 
the expression of an opcrably linked ORF, hereinafter referred to as expression 
35 modulating fragments or EMFs, and fragments which can be used to diagnose the 
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presence of Streptococcus pneumoniae in a sample, hereinafter referred lo as 
diagnostic fragments or DFs. 

Each of the ORFs in fragments of the Streptococcus pneumoniae genome 
disclosed in Tables 1-3, and the EMFs found 5' to the ORFs, can be used in 
5 numerous ways as polynucleotide reagents. For instance, the sequences can be 
used as diagnostic probes or amplification primers for detecting or determining ihe 
presence of a specific microbe in a sample, lo selectively control gene expression in 
a host and in the production of polypeptides, such as polypeptides encoded by 
ORFs of the present invention, particular those polypeptides that have a 

1 0 pharmacological activity. 

The present invention further includes recombinant constructs comprising 
one or more fragments of the Streptococcus pneumoniae genome of the present 
invention. The recombinant constructs of the present invention comprise vectors, 
such as a plasmid or viral vector, into which a fragment of the Streptococcus 

] 5 pneumoniae has been inserted. 

The present invention further provides host cells containing any of the 
isolated fragments of the Streptococcus pneumoniae genome of the present 
invention. The host cells can be a higher eukaryotic host cell, such as a mammalian 
cell, a lower eukaryotic cell, such as a yeast cell, or a procaryotic cell such as a 

20 bacterial cell. 

The present invention is further directed to isolated polypeptides and 
proteins encoded by ORFs of the present invention. A variety of methods, well 
known to those of skill in the art, routinely may be utilized to obtain any of the 
polypeptides and proteins of the present invention. For instance, polypeptides and 

25 proteins of the present invention having relatively short, simple amino acid 
sequences readily can be synthesized using commercially available automated 
peptide synthesizers. Polypeptides and proteins of the present invention also may 
be purified from bacterial cells which naturally produce the protein. Yet another 
alternative is to purify polypeptide and proteins of the present invention from cells 

30 which have been altered to express them. 

The invenuon further provides methods of obtaining homologs of the 
fragments of the Streptococcus pneumoniae genome of the present invention and 
homologs of the proteins encoded by the ORFs of the present invention. 
Specifically, by using the nucleotide and amino acid sequences disclosed herein as 
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a probe or as primers, and techniques such as..EGR cloning and colony/plaque 
hybridization, one skilled in the art can obtain homoiogs. 

The invention funher provides antibodies which selectively bind 
polypeptides and proteins of the present invention. Such antibodies include both 
5 monoclonal and polyclonal antibodies. 

The invention further provides hybridomas which produce the above- 
described antibodies. A hybridoma is an inunortalized cell line which is capable of 
secreting a specific monoclonal antibody. 

The present invention further provides methods of identifying test samples 
10 derived from ceils which express one of the ORFs of the present invention, or a 
homolog thereof. Such methods comprise incubating a lest sample with one or 
more of the antibodies of the present invention, or one or more of the DFs of the 
present invention, under conditions which allow a skilled artisan to determine if the 
sample contains the ORF or product produced therefrom. 
15 In another embodiment of the present invention, kits are provided which 

contain the necessary reagents to carry out the above -described assays. 

Specifically, the invention provides a compartmentalized kit to receive, in 
close confinement, one or more containers which comprises: (a) a first container 
comprising one of the antibodies, or one of the DFs of the present invention; and 
20 (b) one or more other containers comprising one or more of the following: wash 
reagents, reagents capable of detecting presence of bound antibodies or hybridized 
DFs. 

Using the isolated proteins of the present invention, the present invention 
further provides methods of obtaining and identifying agents capable of binding to 

25 a polypeptide or protein encoded by one of the ORFs of the present invention. 
Specifically, such agents include, as further described below, antibodies, peptides, 
carbohydrates, pharmaceutical agents and the like. Such methods comprise steps 
of: (a) contacting an agent with an isolated protein encoded by one of the ORFs of 
the present invention; and (b) determining whether the agent binds to said protein. 

30 The present genomic sequences of Streptococcus pneumoniae will be of 

great value to all laboratories working with this organism and for a variety of 
commercial purposes. Many fragments of the Streptococcus pneumoniae genome 
will be immediately identified by similarity searches against GenBank or protein 
databases and will be of immediate value to Streptococcus pneumoniae researches 
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and for immediate commercial value for the production of proteins or to control 
gene expression. 

The methodology and technology for elucidating extensive genomic 
sequences of bacterial and other genomes has and will greatly enhance the ability to 
5 analyze and understand chromosomal organization. In particular, sequenced 
contigs and genomes will provide the models for developing tools for the analysis 
of chromosome stmcture and function, including the ability to identify genes within 
large segments of genomic DN A, the structure, position, and spacing of regulatory 
elements, the identification of genes with potential industrial applications, and the 
10 ability to do comparative genomic and molecular phylogeny. 

DESCRIPTION OF THE FIGURES 

FIGURE 1 is a block diagram of a computer system (102) that can be 
15 used to implement computer-based systems of present invention. 

FIGURE 2 is a schematic diagram depicting the data flow and computer 
programs used to collect, assemble, edit and annotate the contigs of the 
Streptococcus pneumoniae genome of the present invention. Both Macintosh and 

20 Unix platforms arc used to handle the AB 373 and 377 sequence data files, largely 
as aescribed in Kerlavage et aL, Proceedings of the Twenry-Sixth Annual Hawaii 
International Conference on System Sciences. 585, IEEE Computer Society Press, 
Washington D.C. (1993). Factura (AB) is a Macintosh program designed for 
automatic vector sequence removal and end-trimming of sequence files. The 

25 program Loadis runs on a Macintosh platform and parses the feature data extracted 
from the sequence files by Factura to the Unix based Streptococcus pneumoniae 
relational database. Assembly of contigs (and whole genome sequences) is 
accomplished by retrieving a specific set of sequence files and their associated 
features using Extrseq, a Unix udlity for retrieving sequences from an SQL 

30 database. The resulting sequence file is processed by seq_filter to trim ponions of 
the sequences with more than 2% ambiguous nucleotides. The sequence files were 
assembled using TIGR Assembler, an assembly engine designed at The Institute 
for Genomic Research ( TIGR ) for rapid and accurate assembly of thousands of 
sequence fragments. The collection of contigs generated by the assembly step is 

35 loaded into the database with the lassie program. Identification of open reading 
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frames (ORFs) is accomplished by processing contigs with zorf or GcnMark. The 
ORFs are searched against S. pneumoniae sequences from GenBank and against all 
protein sequences using the BLASTN and BLASTP programs, described in 
Altschul et aL, 7. MoL Biol. 215: 403-410 (1990)). Results of the ORF 
5 determination and similarity searching steps were loaded into the database. As 
described below, some results of the determination and the searches are set out in 
Tables 1-3. 

DETAILED DESCRIPTION O F ILLUSTRATIVE EMBODIMFNT^ 

10 

The present invention is based on the sequencing of fragments of the 
Streptococcus pneumoniae genome and analysis of the sequences. The primary 
nucleotide sequences generated by sequencing the fragments are provided in SEQ 
ID NOS: 1-391. (As used herein, the "primary sequence " refers to the nucleotide 

15 sequence represented by the lUPAC nomenclature system.) 

In addition to the aforementioned Streptococcus pneumoniae polynucleotide 
and polynucleotide sequences, the present invention provides the nucleotide 
sequences of SEQ ID NOS:l-39L or representative fragments thereof, in a form 
which can be readily used, analyzed, and interpreted by a skilled artisan. 

20 As used herein, a "representative fragment of the nucleotide sequence 

depicted in i£Q ID NOS; 1-391" refers to any portion of the SEQ ID NOS: 1-391 
which is not presently represented within a publicly available database. Preferred 
representative fragments of the present invention are Streptococcus pneumoniae 
open reading frames ( ORFs ), expression modulating fragment ( EMFs ) and 

25 fragments which can be used to diagnose the presence of Streptococcus 
pneumoniae in sample ( DFs ). A non-limiting identification of preferred 
representative fragments is provided in Tables 1-3. As discussed in detail below, 
the information provided in SEQ CD NOS: 1-391 and in Tables 1-3 together with 
routine cloning, synthesis, sequencing and assay methods will enable those skilled 

30 in the an to clone and sequence all "representative fragments" of interest, including 
open reading frames encoding a large variety of Streptococcus pneumoniae 
proteins. 

While the presently disclosed sequences of SEQ ID NOS: 1-391 are highly 
accurate, sequencing techniques are not perfect and, in relatively rare instances, 
35 further investigation of a fragment or sequence of the invention may reveal a 
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nucleotide sequence error present in a nucleotide sequence disclosed in SEQ ID 
NOS:I-391. However, once the present invention is made available {i.e.. once the 
information in SEQ ID NOS: 1-391 and Tables 1-3 has been made available), 
resolving a rare sequencing error in SEQ ED NOS: 1-391 will be well within the 
5 skill of the an. The present disclosure makes available sufficient sequence 
information to allow any of the described contigs or portions thereof to be obtained 
readily by straightforward application of routine techniques. Further sequencing of 
such polynucleotide may proceed in like manner using manual and automated 
sequencing methods which are employed ubiquitous in the art. Nucleotide 

10 sequence editing software is publicly available. For example. Applied Biosy stem's 
(AB) AutoAssembler can be used as an aid during visual inspection of nucleotide 
sequences. By employing such routine techniques potential errors readily may be 
identified and the correct sequence then may be asccnained by targeting funhcr 
sequencing effon, also of a routine nature, to the region containing the potential 

15 error. 

Even if all of the very rare sequencing errors in SEQ ID NOS: 1-391 were 
corrected, the resulting nucleotide sequences would still be at least 95% identical, 
nearly all would be at least 99% identical, and the great majority would be at least 
99.9% identical to the nucleotide sequences of SEQ ID NOS: 1-391. 

20 As discussed elsewhere herein, polynucleotides of the present invention 

readily may be obtained by routine application of well known and standard 
procedures for cloning and sequencing DNA. Detailed methods for obtaining 
libraries and for sequencing are provided below, for instance. A wide variety of 
Streptococcus pneumoniae strains that can be used to prepare S, pneumoniae 

25 genomic DNA for cloning and for obtaining polynucleotides of the present 
invention are available to the public from recognized depository institutions, such 
as the American Type Culture Collection ( ATCC ). While the present invention is 
enabled by the sequences and other information herein disclosed, the 5. 
pneumoniae strain that provided the DNA of the present Sequence Listing, Strain 

30 7/87 14.8.91, has been deposited in the ATCC, as a convenience to those of skill 
in the art. As a further convenience, a library of 5. pneumoniae genomic DNA. 
derived from the same strain, also has been deposited in the ATCC. The S. 
pneumoniae strain was deposited on October 10, 1996, and was given Deposit No. 
55840, and the cDNA library was deposited on October 1 1, 1996 and was given 

35 Deposit No. 97755. The genomic fragments in the library are 15 to 20 kb 



wo 98/18931 



10 



PCT/US97/19588 



fragments generated by partial Sau3Al digestion and they arc inserted into the 
BamHI site in the well-known lambda-derived vector lambda DASH II (Stratagene, 
La Jolla, CA). The provision of the deposits is not a waiver of any rights of the 
inventors or their assignees in the present subject matter. 

5 The nucleotide sequences of the genomes from different strains of 

Streptococcus pneumoniae differ somewhat. However, the nucleotide sequences 
of the genomes of all Streptococcus pneumoniae strains will be at least 95% 
identical, in corresponding pan, to the nucleotide sequences provided in SEQ ID 
NOS: 1-391. Nearly all will be at least 99% identical and the great majority will be 

10 99.9% identical. 

Thus, the present invention funhcr provides nucleotide sequences which 
arc at least 95%, preferably 99% and most preferably 99.9% identical to the 
nucleotide sequences of SEQ ID NOS: 1-391, in a form which can be readily used, 
analyzed and interpreted by the skilled anisan. 

15 Methods for determining whether a nucleotide sequence is at least 95%, at 

least 99% or at least 99.9% identical to the nucleotide sequences of SEQ ID 
NOS: 1-391 are routine and readily available to the skilled artisan. For example, the 
well known fasta algorithm described in Pearson and Lipman, Proc. Natl, Acad, 
Sci, USA 85: 2444 (1988) can be used to generate the percent identity of nucleotide 

:o sequences. The BLASTN program also can be used to generate an identity score 
of polynucleotides compared to one another. 

COMPUTER RELATED EMBODIMENTS 

The nucleotide sequences provided in SEQ ID NOS: 1-39 1, a representative 
25 fragment thereof, or a nucleodde sequence at least 95%, preferably at least 99% 
and most preferably at least 99.9% identical to a polynucleotide sequence of SEQ 
ID NOS: 1-391 may be "provided" in a variety of mediums to facilitate use thereof. 
As used herein, provided refers to a manufacture, other than an isolated nucleic 
acid molecule, which contains a nucleotide sequence of the present invention; i.e'., 
30 a nucleotide sequence provided in SEQ ID NOS: 1-391, a representative fragment 
thereof, or a nucleotide sequence at least 95%, preferably at least 99% and most 
preferably at least 99.9% identical to a p)olynucleotide of SEQ ID NOS: 1-391. 
Such a manufacture provides a large ponion of the Streptococcus pneumoniae 
genome and parts thereof {e.g., a Streptococcus pneumoniae open reading frame 
35 (ORP)) in a form which allows a skilled artisan to examine the manufacture using 
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means not directly applicable to examining the Streptococcus pneumoniae genome 
or a subset thereof as it exists in nature or in purified form. 

In one application of this embodiment, a nucleotide sequence of the present 
invention can be recorded on computer readable media. As used herein, "computer 
5 readable media" refers to any medium which can-be read and accessed directly by a 
computer. Such media include, but are not limited to: magnetic storage media, 
such as floppy discs, hard disc storage medium, and magnetic tape; optical storage 
media such as CD- ROM; electrical storage media such as RAM and ROM; and 
hybrids of these categories, such as magnetic/optical storage media. A skilled 

10 artisan can readily appreciate how any of the presently known computer readable 
mediums can be used to create a manufacture comprising computer readable 
medium having recorded thereon a nucleotide sequence of the present invention. 
Likewise, it will be clear to those of skill how additional computer readable media 
that may be developed also can be used to create analogous manufactures having 

15 recorded thereon a nucleotide sequence of the present invention. 

As used herein, **recorded'* refers to a process for storing information on 
computer readable medium. A skilled anisan can readily adopt any of the presently 
know methods for recording information on computer readable medium to generate 
manufactures comprising the nucleotide sequence information of the present 

20 invention. A variety of data storage structures are available to a skilled anisan 
for creating a computer readable medium having recorded thereon a nucleotide 
sequence of the present invention. The choice of the data storage structure will 
generally be based on the means chosen to access the stored information. In 
addition, a variety of data processor programs and formats can be used to store the 

25 nucleotide sequence information of the present invention on computer readable 
medium. The sequence information can be represented in a word processing text 
file, formatted in commercially- available software such as WordPerfect and 
Microsoft Word, or represented in the form of an ASCII file, stored in a database 
application, such as DB2, Sybase, Oracle^ or the like. A skilled artisan can readily 

30 adapt any number of data-processor structuring formats {e.g.. text file or database) 
in order to obtain computer readable medium having recorded thereon the 
nucleotide sequence information of the present invention. 

Computer software is publicly available which allows a skilled anisan to 
access sequence information provided in a computer readable medium. Thus, by 

35 providing in computer readable form the nucleotide sequences of SEQ CD NOS:l- 
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391, a represenutivc fragment thereof, or a nucleotide sequence at least 95%, 
preferably at least 99% and most preferably at least 99.9% identical to a sequence 
of SEQ ID NOS: 1-391 the present invention enables the skilled anisan routinely to 
access the provided sequence information for a wide variety of purposes, 
5 The exan^les which follow demonstrate how software which implements 

the BLAST (Altschul et aL, J, MoL Biol. 2/5:403-410 (1990)) and BLAZE 
(Brutlag et ai, Comp, Chem. 77:203-207 (1993)) search algorithms on a Sybase 
system was used to identify open reading frames (ORFs) within the Streptococcus 
pneumoniae genome which contain homology to ORFs or proteins from both 
10 Streptococcus pneumoniae and from other organisms. Among the ORFs discussed 
herein are protein encoding fragments of the Streptococcus pneumoniae genome 
useful in producing commercially imponant proteins, such as enzymes used in 
fermentation reactions and in the production of commercially useful metabolites. 

The present invention fiinher provides systems, particularly computer- 
15 based systems, which contain the sequence information described herein. Such 
systems are designed to identify, among other things, ctMnmcrcially imponam 
fragments of the Streptococcus pneumoniae genome. 

As used herein, * a computer-based system" refers to the hardware means, 
software means, and data storage means used to analyze the nucleotide sequence 
20 information of the present invention. The minimum hardware means of the 
computer-based systems of the present invention comprises a cental processing 
unit (CPU), input means, output means, and data storage means. A skilled anisan 
can readily appreciate that any one of the currently available computer-based 
systems are suitable for use in the present invention. 

As stated above, the computer-based systems of the present invention 
comprise a dau storage means having stored therein a nucleotide sequence of the 
present invention and the necessary hardware means and software means for 
supporting and implementing a search means. 

As used herein, "data storage means" refers to memory which can store 
30 nucleotide sequence informauon of the present invention, or a memory access 
means which can access manufactures having recorded thereon the nucleotide 
sequence information of the present invention. 

As used herein, "search means" refers to one or more programs which arc 
implemented on the computer-based system to compare a target sequence or target 
J5 Structural motif with the sequence information stored within the data storage 
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means. Search means are used to identify fragments or regions of the present 
genomic sequences which match a particular target sequence or target motif. A 
variety of known algorithms are disclosed publicly and a variety of commercially 
available software for conducting search means are and can be used in the 
5 computer-based systems of the present invention. Examples of such software 
includes, but is not limited to, MacPattem (EMBL), BLASTN and BLASTX 
(NCBIA). A skilled artisan can readily recognize that any one of the available 
algorithms or implementing software packages for conducting homology searches 
can be adapted for use in the present computer-based systems. 

10 As used herein, a "target sequence" can be any DNA or amino acid 

sequence of six or more nucleotides or two or more amino acids. A skilled artisan 
can readily recognize that the longer a target sequence is, the less likely a target 
sequence will be present as a random occurrence in the database. The most 
preferred sequence length of a target sequence is from about 10 to 100 amino acids 

1 5 or from about 30 to 300 nucleotide residues. However, it is well recognized that 
searches for commercially important fragments, such as sequence fragments 
involved in gene expression and protein processing, may be of shorter length. 

As used herein, "a target structural motif," or "target motif," refers to any 
rationally selected sequence or combination of sequences in which the sequence(s) 

20 are chosen based on a three-dimensional configuration which is formed upon the 
folding of the target motif. There arc a variety of target motifs known in the art. 
Protein target motifs include, but are not limited to. enzymic active sites and signal 
sequences. Nucleic acid target motifs include, but are not limited to, promoter 
sequences, hairpin structures and inducible expression elements (protein binding 

25 sequences). 

A variety of structural formats for the input and output means can be used 
to input and output the information in the computer-based systems of the present 
invention. A preferred format for an output means ranks fragments of the 
Streptococcus pneumoniae genomic sequences possessing varying degrees of 

30 homology to the target sequence or target motif. Such presentation provides a 
skilled artisan with a ranking of sequences which contain v£irious amounts of the 
target sequence or target motif and identifies the degree of homology contained in 
the identified fragment. 

A variety of comparing means can be used to compare a target sequence or 

35 target motif with the data storage means to identify sequence fragments of the 
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Streptococcus pneumoniae genome. In the present examples, implementing 
software which implement the BLAST and BLAZE algorithms, described in 
Altschul et al.. J. Mol. Biol. 215: 403-410 (1990). is used to identify open reading 
frames within the Streptococcus pneumoniae genome. A skilled artisan can readily 
recognize that any one of the publicly available homology search programs can be 
used as the search means for the computer-based systems of the present invention. 
Of course, suitable proprietary systems that may be known to those of skill also 
may be employed in this regard. 

Figure 1 provides a block diagram of a computer system illustrative of 
embodiments of this aspect of present invention. The computer system 102 
includes a processor 106 connected to a bus 104. Also connected to the bus 104 
are a main memory 108 (preferably implemented as random access memory. RAM) 
and a variety of secondary storage devices 110. such as a hard drive 112 and a 
removable medium storage device 1 14. The removable medium storage device 1 14 
1 5 may represent, for example, a floppy disk drive, a CD-ROM drive, a magnetic tape 
drive, etc. A removable storage medium 1 16 (such as a floppy disk, a compact 
disk, a magnetic tape, etc. ) containing control logic and/or data recorded therein 
may be inscned into the removable medium storage device 1 14. The computer 
system 102 includes appropriate software for reading the control logic and/or the 
20 data from the removable medium storage device 1 14. once it is inserted into the 
removable medium storage device 1 14. 

A nucleoude sequence of the present invention may be stored in a well 
known manner in the main memory 108. any of the secondary storage devices 1 10. 
and/or a removable storage medium 1 16. During execution, software for accessing 
and processing the genomic sequence (such as search tools, comparing tools, etc.) 
reside in main memory 108. in accordance with the requirements and operating 
parameters of the operating system, the hardware system and the software program 
or programs. 
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BIOCHEMICAL EMBODIMENTS 

Other embodiments of the present invention are directed to isolated 
fragments of the Streptococcus pneumoniae genome. The fragments of the 
5 Streptococcus pneumoniae genome of the present invention include, but are not 
limited to fragments which encode p>eptides and polypeptides, hereinafter open 
reading frames (ORFs). fragments which modulate the expression of an operabiy 
linked ORF, hereinafter expression modulating fragments (EMFs) and fragments 
which can be used to diagnose the presence of Streptococcus pneumoniae in a 

10 sample, hereinafter diagnostic fragments (DFs). 

As used herein, an "isolated nucleic acid molecule" or an "isolated fragment 
of the Streptococcus pneumoniae genome" refers to a nucleic acid molecule 
possessing a specific nucleotide sequence which has been subjected to purification 
means to reduce, from the composition, the number of compounds which arc 

15 normally associated with the composition. Particularly, the term refers to the 
nucleic acid molecules having the sequences set out in SEQ ID NOS: 1-391, to 
representative fragments thereof as described above, to polynucleotides at least 
95%, preferably at least 99% and especially preferably at least 99.9% identical in 
sequence thereto, also as set out above. 

20 A variety of purification means can be used to generate the isolated 

fragments of the present invention. These include, but are not limited to methods 
which separate constituents of a solution based on charge, solubility, or size. 

In one embodiment. Streptococcus pneumoniae DNA can be enzymatically 
sheared to produce fragments of 15-20 kb in length. These fragments can then be 

25 used to generate a Streptococcus pneumoniae library by inserting them into lambda 
clones as described in the Examples below. Primers flanking, for example, an 
ORF, such as those enumerated in Tables 1-3 can then be generated using 
nucleotide sequence information provided in SEQ ID NOS; 1-391. Well known 
and routine techniques of PGR cloning then can be used to isolate the ORF from 

30 the lambda DNA library or Streptococcus pneumoniae genomic DNA. Thus, given 
the availability of SEQ ID NOS: 1-391, the information in Tables 1, 2 and 3, and 
the information that may be obtained readily by analysis of the sequences of SEQ 
ID NOS: 1-391 using methods set out above, those of skill will be enabled by the 
present disclosure to isolate any ORF-containing or other nucleic acid fragment of 

35 the present invention. 
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The isolated nucleic acid molecules of the_piescni inveniion include, but arc 
not limited to single stranded and double stranded DNA, and single stranded RNA. 

As used herein, an "open reading frame/' ORF, means a series of triplets 
coding for amino acids without any termination codons and is a sequence 
5 translatable into protein. 

Tables 1, 2, and 3 list ORFs in the Streptococcus pneumoniae genomic 
contigs of the present invention that were identified as putative coding regions by 
the GeneMark software using organism-specific second-order Markov probability 
transition matrices. It will be appreciated that other criteria can be used, in 
10 accordance with well known analytical methods, such as those discussed herein, to 
generate more inclusive, more restrictive, or more selective lists. 

Table 1 sets out ORFs in the Streptococcus pneumoniae contigs of the 
present invention that over a continuous region of at least 50 bases are 95% or 
more identical (by BLAST analysis) to a nucleotide sequence available through 
15 GcnBank in October, 1997. 

Table 2 sets out ORFs in the Streptococcus pneumoniae contigs of the 
present invention that are not in Table 1 and match, with a BLASTP probability 
score of 0.01 or less, a polypeptide sequence available through GcnBank in 
October, 1997. 

Table 3 sets out ORFs in the Streptococcus pneumoniae contigs of the 
present invention that do not match significantly, by BLASTP analysis, a 
polypeptide sequence available through GenBank in October, 1997. 

In each table, the first and second columns identify the ORF by, 
respectively, contig number and ORF number within the contig; the third column 

25 indicates the first nucleotide of the ORF (actually the first nucleotide of the stop 
codon immediately preceeding the ORF), counting from the 5' end of the contig 
su^d; and the fourth column, "stop (nt)" indicates the last nucleotide of the stop 
codon defining the 3 'end of the ORF. 

In Tables I and 2, column five, lists the Reference for the closest 

30 matching sequence available through GenBank. These reference numbers are the 
databases entry numbers commonly used by those of skill in the art, who will be 
familiar with their denominators. Descriptions of the nomenclature are available 
from the National Center for Biotechnology Information. Column six in Tables I 
and 2 provides the gene name of the matching sequence: column seven provides 

35 the BLAST identity score and column eight the BLAST similarity score from the 
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comparison of the ORF and the homologous gene : and column nine indicates the 
length in nucleotides of the highest scoring segment pair identified by the BLAST 
identity an£Llysis. 

Each ORF described in the tables is defined by '•start (nt)" (5') and '*siop 
5 (nt)" (3') nucleotide position numbers. These position numbers refer to the 
boundaries of each ORF and provide orientation with respect to whether the 
forward or reverse strand is the coding strand and which reading frame the coding, 
sequence is contained. The *'start" position is the first nucleotide of the triplet 
encoding a stop codon just 5' to the ORF and the ' stop'* position is the last 
10 nucleotide of the triplet encoding the next in-frame stop codon (i.e., the stop codon 
at the 3' end of the ORF). Those of ordinary skill in the an appreciate that 
preferred fragments within each ORF described in the table include fragments of 
each ORF which include the entire sequence from the delineated "start" and *'stop" 
positions excepting the first and last three nucleotides since these encode stop 

15 codons. Thus, polynucleotides set out as ORFs in the tables but lacking the three 
(3) 5' nucleotides and the three (3) 3' nucleotides are encompassed by the present 
invention. Those of skill also appreciate that panicularly preferred are fragments 
within each ORF that are polynucleotide fragments comprising polypeptide coding 
sequence. As defined herein, "coding sequence" includes the fragment within an 

20 ORF beginning at the first in-framc ATG (uiplet encoding methionine) and ending 
with the last nucicotide prior to the triplet encoding the 3' stop codon. Preferred 
are fragments comprising the entire coding sequence and fragments comprising the 
entire coding sequence, excepting the coding sequence for the N-terminal 
methionine. Those of skill appreciate that the N-terminal methionine is often 

25 removed during post-translational processing and that polynucleotides lacking the 
ATG can be used to facilitate production of N-termainal fusion proteins which may 
be benefical in the production or use of genetically engineered proteins. Of course, 
due to the degeneracy of the genetic code many polynucleotides can encode a given 
polypeptide. Thus, the invention further includes polynucleotides comprising a 

30 nucleotide sequence encoding a polypeptide sequence itself encoded by the coding 
sequence within an ORF described in Tables 1-3 herein. Further, polynucleotides 
at least 95%. preferably at least 99% and especially preferably at least 99.9% 
identical in sequence to the foregoing polynucleotides, are contemplated by the 
present invention. 
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Polypeptides encoded by polynucleotides described above and elsewhere 
herein are also provided by the present invention as are polypeptide comprising a 
an aniino acid sequence at least about 95%. preferably at least 97% and even more 
preferably 99% identical to the amino acid sequence of a polypeptide encoded by an 
ORF shown in Tables 1-3. These polypeptides may or may not comprise an N- 
terminal methionine. 

The concepts of percent identity and percent similarity of two polypeptide 
sequences is well understood in the art. For example, two polypepudes 10 amino 
acids in length which differ at three amino acid positions (e.g., at positions 1, 3 
and 5) are said to have a percent identity of 70%. However, the same two 
polypeptides would be deemed to have a percent similarity of 80% if. for example 
at position 5, the amino acids moieties, although not identical, were "similar" (i.e.. 
possessed similar biochemical characteristics). Many programs for analysis of 
nucleotide or amino acid sequence similarity, such as fasia and BLAST specifically 
15 list percent identity of a matching region as an output parameter. Thus, for 
instance. Tables 1 and 2 herein enumerate the percent identity of the highest 
scoring segment pair in each ORF and its listed relative. Further details 
concerning the algorithms and criteria used for homology searches are provided 
below and are described in the penincnt literature highlighted by the citauons 
20 provided below. 

It will be appreciated that other criteria can be used to generate more 
inclusive and more exclusive listings of the types set out in the tables. As those of 
skill will appreciate, narrow and broad searche.s both are useful. Thus, a skilled 
anisan can readily identify ORFs in contigs of the Streptococcus pneumoniae 
genome other than those listed in Tables 1-3. such as ORFs which are overiapping 
or encoded by the opposite strand of an identified ORF in addition to those 
ascenainable using the computer-based systems of the present invention. 

As used herein, an "expression modulating fragment." EMF. means a 
series of nucleotide molecules which modulates the expression of an operably 
30 linked ORF or EMF. 
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As used herein, a sequence is said to "modulate the expression of an 
operably linked sequence" when the expression of the sequence is altered by the 
presence of the EMF. EMFs include, but are not limited to, promoters, and 
promoter modulating sequences (inducible elements). One class of EMFs are 
5 fragments which induce the expression or an operably linked ORF in response to a 
specific regulatory factor or physiological event. 

EMF sequences can be identified within the contigs of the Streptococcus 
pneumoniae genome by their proximity to the ORFs provided in Tables 1-3. An 
intergenic segment, or a fragment of the intergenic segment, from about 10 to 200 
10 nucleotides in length, taken from any one of the ORFs of Tables 1-3 will modulate 
the expression of an operably linked ORF in a fashion similar to that found with the 
naturally linked ORF sequence. As used herein, an "intergenic segment" refers to 
fragments of the Streptococcus pneumoniae genome which are between two 
ORF(s) herein described. EMFs also can be identified using known EMFs as a 
15 target sequence or target motif in the computer-based systems of the present 
invention. Further, the two methods can be combined and used together. 

The presence and activity of an EMF can be confirmed using an EMF trap 
vector. An EMF u-ap vector contains a cloning site linked to a marker sequence, A 
marker sequence encodes an identifiable phenotype, such as antibiotic resistance or 
20 a complementing nutrition auxotrophic factor, which can be identified or assayed 
when the EMF u^p vector is placed within an appropriate host under appropriate 
conditions. As described above, a EMF will modulate the expression of an 
operably linked marker sequence. A more detailed discussion of various marker 
sequences is provided below. A sequence which is suspected as being an EMF is 
25 cloned in all three reading frames in one or more restriction sites upstream from the 
marker sequence in the EMF trap vector. The vector is then transfomied into an 
appropriate host using known procedures and the phenotype of the transformed 
host in examined under appropriate conditions. As described above, an EMF will 
modulate the expression of an operably linked marker sequence. 

used herein, a "diagnostic fragment," DF, means a series of nucleotide 
molecules which selecuveiy hybridize to Streptococcus pneumoniae sequences. 
DFs can be readily identified by identifying unique sequences within contigs of the 
Streptococcus pneumoniae genome, such as by using well-known computer 
analysis software, and by generating and testing probes or amplification primers 
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consisting of the DF sequence in an appropriate diagnostic format which 
determines amplification or hybridization selectivity. 

The sequences falling within the scope of the present invention arc not 
limited to the specific sequences herein described, but also include allelic and 
5 species variations thereof. Allelic and species variations can be routinely 
detemiined by comparing the sequences provided in SEQ ID NOS: 1-391, a 
representative fragment thereof, or a nucleotide sequence at least 95%, prcferrably 
at least 99% and most al least preferably 99.9% identical to SEQ ID NOS: 1-391, 
with a sequence from another isolate of the same species. Furthermore, to 
10 accommodate codon variability, the invention includes nucleic acid molecules 
coding for the same amino acid sequences as do the specific ORFs disclosed 
herein. In other words, in the coding region of an ORF, substitution of one codon 
for another which encodes the same amino acid is expressly contemplated. Any 
specific sequence disclosed herein can be readily screened for errors by 
!5 resequencing a panicular fragment, such as an ORF, in both directions (/.£., 
sequence both strands). Alternatively, error screening can be performed by 
sequencing corresponding polynucleotides of Streptococcus pneumoniae origin 
isolated by using pan or all of the fragments in question as a probe or primer. 

Preferred DFs of the present invention comprise at least about 17, 
20 prcferrably at least about 20, and more preferrably al least about 50 contiguous 
nucleotides within an ORF set out in Tables 1-3. Most highly preferred DFs 
specifically hybridize to a polynucleotide containing the sequence of the ORF from 
which they are derived. Specific hybridization occurs even under stringent 
conditions defined elsewhere herein. 

of the ORFs of the Streptococcus pneumoniae genome disclosed in 
Tables 1, 2 and 3, and the EMFs found 5' to the ORFs, can be used as 
polynucleotide reagents in numerous ways. For example, the sequences can be 
used as diagnostic probes or diagnostic amplification primers to detect the presence 
of a specific microbe in a sample, particularly Streptococcus pneumoniae. 
30 Especially preferred in this regard are ORFs such as those of Table 3, which do not 
match previously characterized sequences from other organisms and thus are most 
likely to be highly selective for Streptococcus pneumoniae. Also particularly 
preferred are ORFs that can be used to distinguish between strains of Streptococcus 
pneumoniae^ particulariy those that distinguish medically important su-ain, such as 
35 drug-resistant strains. 
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In addiiion, the fragments of the present invention, as broadly described, 
can be used to control gene expression through triple helix formation or antisense 
DNA or RNA, both of which methods are based on the binding of a polynucleotide 
sequence to DNA or RNA. Triple helix- formation optimally results in a shut-off of 

5 RNA transcription from DNA, while antisense RNA hybridization blocks 
translation of an mRNA molecule into polypeptide. Information from the 
sequences of the present invention can be used to design antisense and triple helix- 
forming oligonucleotides. Polynucleotides suitable for use in these methods are 
usually 20 to 40 bases in length and are designed to be complementary to a region 

10 of the gene involved in transcription, for triple-helix formation, or to the mRNA 
itself, for antisense inhibition. Both techniques have been demonstrated to be 
effective in model systems, and the requisite techniques are well known and 
involve routine procedures. Triple helix techniques are discussed in, for example, 
Lee et aL, NucL Acids Res. 6:3073 (1979); Cooncy er aL. Science 241:456 

15 (1988); and Dervan et aL. Science 257:1360 (1991). Antisense techniques in 
general are discussed in, for instance, Okano, J, Neurochem, 56:560 (1991) and 
Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression. CRC Press, 
Boca Raton, FL (1988)). 

The present invention further provides recombinant constructs comprising 

20 one or more fragments of the Streptococcus pneumoniae genomic fragments and 
contigs of the present invention. Certain preferred recombinant constructs of the 
present invention comprise a vector, such as a plasmid or viral vector, into which a 
fragment of the Streptococcus pneumoniae genome has been inserted, in a forward 
or reverse orientation. In the case of a vector comprising one of the ORFs of the 

25 present invention, the vector may further comprise regulatory sequences, including 
for example, a promoter, operably linked to the ORF. For vectors comprising the 
EMFs of the present invention, the vector may further comprise a marker sequence 
or heterologous ORF operably linked to the EMF. 

Large numbers of suitable vectors and promoters are known to those of 

30 skill in the art and are commercially available for generating the recombinant 
constructs of the present invention. The following vectors are provided by way of 
example. Useful bacterial vectors include phagescript, PsiX174. pBluescripl SK. 
pBS KS, pNHSa, pNH16a, pNHlSa, pNH46a (available from Stratagene); 
pTrc99A, pKK223-3, pKK233-3, pDR540, pRIT5 (available from Pharmacia). 

35 Useful eukaryotic vectors include pWLneo, pSV2cat, pOG44, pXTL pSG 
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(available from Straugcne) pSVK3. pBPV, pMSG. pSVL (available from 
Pharmacia). 

Promoter regions can be selected from any desired gene using CAT 
(chloramphenicol transferase) vectors or other vectors with selectable markers 
Two appropriate vectors are pKK232-8 and pCM7. Particular named bacterial 
promoters include lad. lacZ. T3, T7. gpt. lambda PR. and trc. Eukaryotic 
promoters include CMV immediate early, HSV thymidine kinase, early and late 
SV40. LTRs from retrovirus, and mouse metallothionein- I. Selection of the 
appropriate vector and promoter is well within the level of ordinary skill in the art. 

The present invention funher provides host cells containing any one of the 
isolated fragments of the Streptococcus pneumoniae genomic fragments and 
contigs of the present invention, wherein the fragment has been introduced into the 
host cell using known methods. The host cell can be a higher eukaryotic host 
cell, such as a mammalian cell, a lower eukaryotic host cell, such as a yeast cell, or 
1 5 a procaryoiic cell, such as a bacterial cell. 

A polynucleotide of the present invention, such as a recombinant construct 
comprising an ORF of the present invention, may be introduced into the host by a 
variety of well esublished techniques that are standard in the art. such as calcium 
phosphate iransfection. DEAE. dextran mediated transfection and electroporation 
10 which are described in. for instance. Davis, L. et al., BASIC METHODS IN 
MOLECULAR BIOLOGY (1986). 

A host cell containing one of the fragments of the Streptococcus 
pneumoniae genomic fragments and conugs of the present invention, can be used 
in conventional manners to produce the gene product encoded by the isolated 
5 fragment (in the case of an ORF) or can be used to produce a heterologous protein 
under the control of the EMF. The present invention further provides 

isolated polypeptides encoded by the nucleic acid fragments of the present 
mvention or by degenerate variants of the nucleic acid fragments of the present 
mvention. By "degenerate variant" is intended nucleotide fragments which differ 
) from a nucleic acid fragment of the present invention (e.g.. an ORF) by nucleotide 
•sequence but. due to the degeneracy of the Genetic Code, encode an identical 
polypeptide sequence. 

Preferred nucleic acid fragments of the present invention are the ORFs and 
subfragments thereof depicted in Tables 2 and 3 which encode proteins. 
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A variety of methodologies known in the an can be utilized to obtain any 
one of the isolated polypeptides or proteins of the present invention. At the 
simplest level, the amino acid sequence can be synthesized using commercially 
available peptide synthesizers. This is particularly useful in producing small 
5 peptides and fragments of larger polypeptides. -^uch short fragments as may be 
obtained most readily by synthesis are useful, for example, in generating antibodies 
against the native polypeptide, as discussed further below. 

In an alternative method, the polypeptide or protein is purified from 
bacterial cells which naturally produce the polypeptide or protein. One skilled in 

10 the an can readily employ well-known methods for isolating polypeptides and 
proteins to isolate and purify polyiDepiidcs or proteins of the present invention 
produced naturally by a bacterial strain, or by other methods. Methods for 
isolation and purification that can be employed in this regard include, but are not 
limited to, immunochromatography, HPLC, size-exclusion chromatography, ion- 

15 exchange chromatography, and immuno-affinity chromatography. 

The polypeptides and proteins of the present invention also can be purified 
from cells which have been altered to express the desired polypeptide or protein. 
As used herein, a cell is said to be altered to express a desired polypeptide or 
protein when the cell, through genetic manipulation, is made to produce a 

20 ' polypeptide or protein which it normally docs not produce or which the cell 
normally produces at a lower level. Those skilled in the an can readily adapt 
procedures for introducing and expressing either recombinant or synthetic 
sequences into eukaryotic or prokaryotic cells in order to generate a cell which 
produces one of the polypeptides or proteins of the present invention. 

25 Any host/vector system can be used to express one or more of the ORFs of 

the present invention. These include, but are not limited to, eukaryotic hosts such 
as HeLa cells, CV-1 cell, COS cells, and Sf9 cells, as well as prokaryotic host 
such as coli and B. subtilis. The most preferred cells are those which do not 
normally express the particular polypeptide or protein or which expresses the 

30 polypeptide or protein at low natural level. 
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"Recombinant," as used herein, means that a polypeptide or protein is 
derived from recombinant (e.g., microbial or mammalian) expression systems. 
"Microbial" refers to recombinant polypeptides or proteins made in bacterial or 
fungal {e.g.. yeast) expression systems. As a product, "recombinant 
microbial-'defines a polypeptide or protein essentially free of native endogenous 
substances and unaccompanied by associated native glycosylation. Polypeptides or 
proteins expressed in most bacterial cultures, e.g., E. coli, will be free of 
glycosylation modifications; polypeptides or proteins expre.ssed in yeast will have a 
glycosylation pattern different from that expressed in mammalian cells. 

"Nucleotide sequence" refers to a hcteropolymcr of deoxyribonudeoiides. 
Generally. DNA segments encoding the polypeptides and proteins provided by this 
invention are assembled from fragments of the Streptococcus pneumoniae genome 
and shon oligonucleotide linkers, or from a series of oligonucleotides, to provide a 
synthetic gene which is capable of being expres.sed in a recombinant u^mscripuonal 
1 5 unit comprising regulatory elements derived from a microbial or viral operon. 

Recombinant expression vehicle or vector" refers to a plasmid or phage or 
virus or vector, for expressing a polypeptide from a DNA (RNA) sequence. The 
expression vehicle can comprise a transcriptional unit comprising an assembly of 
(Da genetic regulatory elements necessary for gene expression in the host. 
20 including elements required to initiate and maintain transcription at a level sufficient 
for suitable expression of the desired polypeptide, including, for example, 
promoters and. where necessary, an enhancer and a polyadenylation signal: (2) a 
structural or coding sequence which is transcribed into mRNA and translated into 
protein, and (3) appropriate signals to initiate translation at the beginning of the 
desired coding region and terminate translation at its end. Stnictural units intended 
for use in yeast or eukaryotic expression systems preferably include a leader 
sequence enabling extracellular secretion of translated protein by a host cell. 
Alternatively, where recombinant protein is expressed without a leader or transport 
sequence, it may include an N-terminal methionine residue. This residue may or 
may not be subsequently cleaved from the expressed recombinant protein to 
provide a final product. 

"Recombinant expression system" means host cells which have stably 
integrated a recombinant u^scriptional unit into chromosomal DNA or carry the 
recombinant transcriptional unit extra chromosomally. The cells can be prokaryotic 
or eukaryotic. Recombinant expression systems as defined herein will express 
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heterologous polypeptides or proteins upon induction of the regulatory elements 
linked to the DNA segment or synthetic gene to be expressed. 

Mature proteins can be expressed in mammalian cells, yeast, bacteria, or 
other cells under the control of appropriate promoters. Cell-free translation 
5 systems can also be employed to produce such proteins using RNAs derived from 
the DNA construct; of the present invention. Appropriate cloning and expression 
vectors for use with prokaryotic and eukaryotic hosts are described in Sambrook ei 
al.. Molecular Cloning: A Laboratory Manual, 2^^ Edition, Cold Spring Harbor 
Laboratory Press, Cold Spring Harbor, New York (1989), the disclosure of which 

10 is hereby incorporated by reference in its entirety. 

Generally, recombinant expression vectors will include origins of 
replication and selectable markers permitting transformation of the host cell, e.g., 
the ampicillin resistance gene of £. coli and S. cerevisiae TRPl gene, and a 
promoter derived from a highly expressed gene to direct transcription of a 

15 downstream structural sequence. Such promoters can be derived from operons 
encoding glycolytic enzymes such as 3- phosphoglyceraie kinase (PGK), alpha- 
factor, acid phosphatase, or heal shock proteins, among others. The heterologous 
structural sequence is assembled in appropriate phase with translation initiation and 
termination sequences, and preferably, a leader sequence capable of directing 

20 secretion of translated protein into the periplasmic space or extracellular medium. 
Optionally, the heterologous sequence can encode a fusion protein including an N- 
terminai identification peptide imparting desired characteristics, e.g., stabilization 
or simplified purification of expressed recombinant product. 

Useful expression vectors for bacterial use are constructed by insening a 

25 structural DNA sequence encoding a desired protein together with suitable 
translation initiation and termination signals in operable reading phase with a 
functional promoter. The vector will comprise one or more phenotypic selectable 
markers and an origin of replication to ensure maintenance of the vector and. when 
desirable, provide amplification within the host. 

30 Suitable prokaryotic hosts for transformation include strains of E. coli. B. 

subtilis. Salmonella ryphimurium and various species within the genera 
Pseudomonas and Streptomyces. Others may, also be employed as a matter of 
choice. 

As a representative but non-limiting example, useful expression vectors for 
35 bacterial use can comprise a selectable marker and bacterial origin of replication 
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derived from commercially available plasmids comprising genetic elements of the 
well known cloning vector pBR322 (ATCC 37017). Such commercial vectors 
include, for example, pKK223-3 (available form Pharmacia Fine Chemicals, 
Uppsala, Sweden) and GEM 1 (available from Promega Biotec, Madison, WI, 
5 USA). These pBR322 "backbone" sections -afe combined with an appropriate 
promoter and the siruciuraJ sequence to be expressed. 

Following u-ansformation of a suitable host strain and growth of the host 
su-ain to an appropriate cell density, the selected promoter, where it is inducible, is 
dereprcssed or induced by appropriate means {e.g., temperature shift or chemical 
10 induction) and cells are cultured for an additional period to provide for expression 
of the induced gene product. Thereafter cells are typically harvested, generally by 
centrifugaiion, disrupted to release expressed protein, generally by physical or 
chemical means, and the resulting crude extract is retained for further purification. 

Various mammalian cell culture systems can also be employed to express 
15 recombinant protein. Examples of mammalian expression systems include the 
COS-7 lines of monkey kidney fibroblasts, described in Gluzman, Cell 23:115 
( 198 1 X and other cell lines capable of expressing a compatible vector, for example, 
the C127, 3T3, CHO, HeLa and BHK cell lines. 

Mammalian expression vectors will comprise an origin of replication, a 
20 suitable promoter and enhancer, and also any necessary ribosome binding sites, 
polyadenylation site, splice donor and acceptor sites, transcriptional termination 
sequences, and 5' flanking nontranscribed sequences. DNA sequences derived 
from the SV40 viral genome, for example, SV40 origin, early promoter, enhancer, 
splice, and polyadenylation sites may be used to provide the required 
25 nontranscribed genetic elements. 

Recombinant polypeptides and proteins produced in bacterial culture is 
usually isolated by initial extraction from cell pellets, followed by one or more 
salting-out, aqueous ion exchange or size exclusion chromatography steps. 
Microbial cells employed in expression of proteins can be disrupted by any 
30 convenient method, including freeze-thaw cycling, sonication, mechanical 
disnjption, or use of cell lysing agents. Protein refolding steps can be used, as 
necessary, in completing configuration of the mature protein. Finally, high 
performance liquid chromatography (HPLC)can be employed for final purification 
steps. 
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The present invention further includes isolated polypeptides, proteins and 
nucleic acid molecules which are substantially equivalent to those herein described. 
As used herein, substantially equivalent can refer both to nucleic acid and amino 
acid sequences, for example a mutant sequence, that varies from a reference 
5 sequence by one or more substitutions, deletions, or additions, the net effect of 
which does not result in an adverse functional dissimilarity between reference and 
subject sequences. For purposes of the present invention, sequences having 
equivalent biological activity, and equivalent expression characteristics are 
considered substantially equivalent. For purposes of determining equivalence, 

10 truncation of the mature sequence should be disregarded. 

The invenuon further provides methods of obtaining homologs from other 
strains of Streptococcus pneumoniae, of the fragments of the Streptococcus 
pneumoniae genome of the present invention and homologs of the proteins encoded 
by the ORFs of the present invention. As used herein, a sequence or protein of 

15 Streptococcus pneumoniae is defined as a homolog of a fragment of the 
Streptococcus pneumoniae fragments or coniigs or a protein encoded by one of the 
ORFs of the present invention, if it shares significant homology to one of the 
fragments of the Streptococcus pneumoniae genome of the present invention or a 
protein encoded by one of the ORFs of the present invention. Specifically, by 

20 using the sequence disclosed herein as a probe or as primers, and techniques such 
as PCR cloning and colony/plaque hybridization, one skilled in the art can obtain 
homologs. 

As used herein, two nucleic acid molecules or proteins are said to "share 
significant homology" if the two contain regions which possess greater than 85% 

25 sequence (amino acid or nucleic acid) homology. Preferred homologs in this 
regard are those with more than 90% homology. Espjecially preferred are those 
with 93% or more homology. Among especially preferred homologs those with 
95% or more homology are particularly preferred. Very particularly preferred 
among these are those with 97% and even more particularly preferred among those 

30 are homologs with 99% or more homology. The most preferred homologs among 
these are those with 99.9% homology or more. It will be understood that, among 
measures of homology, identity is particularly preferred in this regard. 

Region specific primers or probes derived from the nucleotide sequence 
provided in SEQ ID NOS: 1-391 or from a nucleotide sequence at least 95%, 

35 particulariy at least 99%, especially at least 99,5% identical to a sequence of SEQ 
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ID NOS: 1-391 can be used to prime DNA synthesis and PCR amplification, as 
well as to identify colonies containing cloned DNA encoding a homoiog. Methods 
suitable to this aspect of the present invention are well known and have been 
described in great detail in many publications such as. for example. Innis et al.. 
PCR Protocols. Academic Press, San Diego. CA (1990)). 

When using primers derived from SEQ ID NOS: 1-391 or from a nucleotide 
sequence having an aforementioned identity to a sequence of SEQ ID NOS: 1-391 . 
one skilled in the art will recognize that by employing high stringency conditions 
(e.g., annealing at 50-60°C in 6X SSPC and 50% formamide. and washing at 50- 
eS'C in 0.5X SSPC) only sequences which are greater than 75% homologous to 
the primer will be amplified. By employing lower stringency conditions {e.g., 
hybridizing at 35-37°C in 5X SSPC and 40-45% foiroamide. and washing at 42"C 
in 0.5X SSPC), sequences which are greater than 40-50% homologous to the 
primer will also be amplified. 
15 When using DNA probes derived from SEQ ID NOS: 1-39 1, or from a 

nudeoude sequence having an aforementioned identity to a ^q.i»n^^ of SEQ ID 
NOS: 1-391, for colony/plaque hybridization, one skilled in the an will recognize 
that by employing high stringency conditions (e.g., hybridizing at 50- 65»C in 5X 
SSPC and 50% fomiamide. and washing at 50- 65°C in 0.5X SSPC), sequences 
having regions which are greater than 90% homologous to the probe can be 
obtained, and that by employing lower stringency conditions (e.g., hybridizing at 
35-37^ in 5X SSPC and 40-45% formamide, and washing at 42"C in 0.5X 
SSPC). sequences having regions which are greater than 35-45% homologous to 
the probe will be obtained. 

Any organism can be used as the source for homologs of the present 
invention so long as the organism naturally expresses such a protein or contains 
genes encoding the same. The most preferred organism for isolating homologs are 
bacteria which are closely related to Streptococcus pneumoniae. 

30 ILLUSTRATIVE USES OF COMPOSITIONS OF THE 

INVENTION 

Each ORP provided in Tables 1 and 2 is identified with a function by 
homology to a known gene or polypeptide. As a result, one skilled in the an can 
use the polypeptides of the present invention for commercial, therapeutic and 
35 industrial purposes consistent with the type of putauve idenufication of the 
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polypeptide. Such identifications permit one— skilled in the an to use the 
Streptococcus pneumoniae ORFs in a manner similar to the known type of 
sequences for which the identification is made; for example, to ferment a particular 
sugar source or to produce a particular metabolite. A variety of reviews illustrative 
5 of this aspect of the invention are available, including the following reviews on the 
industrial use of enzymes, for example, BIOCHEMICAL ENGINEERING AND 
BIOTECHNOLCM3Y HANDBOOIC 2nd Ed., MacMiUan Publications, Ltd. NY 
(1991) and BIOCATALYSTS IN ORGANIC SYNTHESES, Trampcr et aL, Eds., 
Elsevier Science Publishers, Amsterdam, The Netherlands (1985). A variety of 
10 exemplary uses that illustrate this and similar aspects of the present invention are 
discussed below. 

1. Biosynthetic Enzymes 

Open reading frames encoding proteins involved in mediating the catalytic 

15 reactions involved in intermediary and macronxtlecular metabolism, the 
biosynthesis of small molecules, cellular processes and other functions includes 
enzymes involved in the degradation of the intermediary products of metabolism, 
enzymes involved in central intermediary metabolism, enzymes involved in 
respiration, both aerobic and anaerobic, enzymes involved in fermentation, 

20 enzymes involved in ATP proton motor force conversion, enzymes involved in 
broad regulatory function, enzymes involved in amino acid synthesis, enzymes 
involved in nucleotide synthesis, enzymes involved in cofactor and viiamin 
synthesis, can be used for industrial biosynthesis. 

The various metabolic pathways present in Streptococcus pneumoniae can 

25 be identified based on absolute nutritional requirements as well as by exannining the 
various enzymes identified in Table 1-3 and SEQ ID NOS: 1-39 1. 

Of particular interest are polypeptides involved in the degradation of 
intermediary metabolites as well as non-macromolecular metabolism. Such 
enzymes include amylases, glucose oxidases, and catalase. 

30 Proteolytic enzymes are another class of commercially important enzymes. 

Proteolytic enzymes find use in a number of industrial processes including the 
processing of flax and other vegetable fibers, in the extraction, clarification and 
depectinization of fmit juices, in the extraction of vegetables' oil and in the 
maceration of fruits and vegetables to give unicellular fruits. A detailed review of 

35 the proteolytic enzymes, used in the food industry is provided in Rombouts et al.. 



wo 98/18931 



30 



PCT/US97/19588 



Symbiosis 21:19 (1986) and Voragen et al, in Biocatalysts In Agricultural 
Biotechnology, Whitaker et qL^ Eds., American Chemical Society Symposium 
Series 389:93 (1989). 

The metabolism of sugars is an important aspect of the primary metabolism 
5 of Streptococcus pneumoniae. Enzymes involved in the degradation of sugars, 
such as, particularly, glucose, galactose, fructose and xylose, can be used in 
industrial fermentation. Some of the important sugar transforming enzymes, from 
a commercial viewpoint, include sugar isomerases such as glucose isomerase. 
Other metabolic enzymes have found commercial use such as glucose oxidases 
10 which produces ketogulonic acid (KGA). KGA is an intermediate in the 
commercial production of ascorbic acid using the Reichstein's procedure, as 
described in Krucger a/.. Biotechnology 6fA) . Rhine et al,^ Eds., Veriag Press. 
Weinhcim, Germany (1984), 

Glucose oxidase (GOD) is commercially available and has been used in 
15 purified form as well as in an immobilized form for the deoxygcnaiion of beer. 
See, for instance, Hanmeir et aL, Biotechnology Letters 1:21 (1979). The most 
important application of GOD is the industrial scale fermentation of gluconic acid. 
Market for gluconic acids which are used in the detergent, textile, leather, 
photographic, pharmaceutical, food, feed and concrete industry, as described, for 
20 example, in Bigelis et aL, beginning on page 357 in GENE MANIPULATIONS 
AND FUNGI; Benett et aU Eds., Academic Press, New York (1985), In addition 
to industrial applications, GOD has found applications in medicine for quantitative 
determination of glucose in body fluids recently in biotechnology for analyzing 
syrups from starch and cellulose hydrosylaies. This application is described in 
25 Owusu et aL Biochem. et Biophysica. Acta. 572. 83 (1986), for instance. 

The main sweetener used in the worid today is sugar which comes from 
sugar beets and sugar cane. In the field of indusuial enzymes, the glucose 
isomerase process shows the largest expansion in the market today. Initially, 
soluble enzymes were used and later immobilized enzymes were developed 
30 (Kjueger et aL. Biotechnology, The Textbook of Industrial Microbiology^ Sinauer 
Associated Incorporated, Sunderland, Massachusetts (1990)). Today, the use of 
glucose- produced high fructose syrups is by far the largest indusuial business 
using immobilized enzymes. A review of the industrial use of these enzymes is 
provided by Jorgensen, Starch 40:307 (1988). 
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Proteinases, such as alkaline serine proteinases* are used as detergent 
additives and thus represent one of the largest volumes of microbial enzymes used 
in the industrial sector. Because of their industrial importance, there is a large body 
of published and unpublished information regarding the use of these enzymes in 
industrial processes. (See Faultman et aL, Acid Proteases Structure Function and 
Biology, Tang, J., ed.. Plenum Press, New York (1977) and Godfrey er qL, 
Industrial Enzymes, MacMillan Publishers, Surrey, UK (1983) and Hepner et aL, 
Report Industrial Enzymes by 1990, Hel Hepner & Associates, London ( 1986)). 

Another class of commercially usable proteins of the present invention are 
the microbial lipases, described by, for instance, Macrae et aL, Philosophical 
Transactions of the Chiral Society of London 3 J 0:221 (1985) and Poscrke, Journal 
of the American Oil Chemist Society 67,' 1758 (1984). A major use of lipases is in 
the fat and oil industry for the production of neutral glycerides using lipase 
catalyzed inter-esterification of readily available triglycerides. Application of 
lipases include the use as a detergent additive to facilitate the removal of fats from 
fabrics in the course of the washing procedures. 

The use of enzymes, and in particular microbial enzymes, as catalyst for 
key steps in the synthesis of complex organic molecules is gaining popularity at a 
great rate. One area of great interest is the preparation of chiral intermediates. 
Preparation of chiral intermediates is of interest to a wide range of synthetic 
chemists particularly those scientists involved with the preparation of new 
pharmaceuticals, agrochemicals, fragrances and flavors. (See Davies et a/.. Recent 
Advances in the Generation of Chiral Intermediates Using Enzymes, CRC Press, 
Boca Raton, Florida (1990)). The following reactions catalyzed by enzymes are of 
interest to organic chemists: hydrolysis of carboxylic acid esters, phosphate esters, 
amides and nitiiies, esterification reactions, trans-esterification reactions, synthesis 
of amides, reduction of alkanones and oxoalkanates, oxidation of alcohols to 
carbonyl compounds, oxidation of sulfides to sulfoxides, and carbon bond forming 
reactions such as the aldol reaction. 

When considering the use of an enzyme encoded by one of the ORFs of the 
present invention for biotransformation and organic synthesis it is sometimes 
necessary to consider the respective advantages and disadvantages of using a 
microorganism as opposed to an isolated enzyme. Pros and cons of using a whole 
cell system on the one hand or an isolated partially purified enzyme on the other 
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hand, has been described in detail by Bud ei ai. Chemistry in Britain (1987). p. 
127. 

Amino transferases, enzymes involved in the biosynthesis and metabolism 
of amino acids, are useful in the catalytic production of amino acids. The 

5 advantages of using microbial based enzyme systems is that the amino transferase 
enzymes catalyze the stereo- selective synthesis of only L-amino acids and 
generally possess uniformly high catalytic rates. A description of the use of amino 
transferases for amino acid production is provided by Roselle-David, Methods of 
Enzymology J 36:479 (1987). 

0 Another category of useful proteins encoded by the ORFs of the present 

invention include enzymes involved in nucleic acid synthesis, repair, and 
recombination. 



2. Generation of Antibodies 

15 As described here, the proteins of the present invention, as well as 

homologs thereof, can be used in a variety of procedures and methods knov^n in 
the art which arc currently applied to other proteins. The proteins of the present 
invenuon can further be used to generate an antibody which selectively binds the 
protein. Such antibodies can be either monoclonal or polyclonal antibodies, as well 
20 fragments of these antibodies, and humanized forms. 

The invention further provides antibodies which seleaively bind to one of 
the proteins of the present invention and hybridomas which produce these 
antibodies. A hybridoma is an immortalized cell line which is capable of secreting 
a specific monoclonal antibody. 

In general, techniques for preparing polyclonal and monoclonal antibodies 
as well as hybridomas capable of producing the desired antibody are well known in 
the art (Campbell, A. M.. Monoclonal Antibody Technology: Laboratory 
Techniques In Biochemistry And Molecular Biology, Elsevier Science Publishers, 
Amsterdam, The Netheriands (1984); St. Groth et al., J. Immunol. Methods 35: I- 
30 21 (1980), Kohler and Milstein. Nature 256;495-497 (1975)). the trioma 
technique, the human B-cell hybridoma technique (Kozbor et al.. Immunology 
Today 4:12 (1983), pgs. 77-96 of Cole et ai, in Monoclonal Antibodies And 
Cflncer JTierapj;. Alan R.Liss. Inc. (1985)). Any animal (mouse, rabbit, 

etc.) which is known to produce antibodies can be immunized with the pseudogene 
35 polypeptide. Methods for immunization are well known in the art. Such methods 
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include subcutaneous or imerperitoneal injection of the polyp>eptide. One skilled in 
the art will recognize that the amount of the protein encoded by the ORF of the 
present invention used for immunization will vary based on the animal which is 
immunized, the antigenicity of the peptide and the site of injection. 
5 The protein which is used as an immunogen may be modified or 

administered in an adjuvant in order to increase the protein's antigenicity. Methods 
of increasing the antigenicity of a protein are well known in the an and include, but 
are not limited to coupling the antigen with a heterologous protein (such as globulin 
or galaciosidase) or through the inclusion of an adjuvant during immunization. 

10 For monoclonal antibodies* spleen cells from the immunized animals are 

removed, fused with myeloma cells, such as SP2/0-Agl4 myeloma cells, and 
allowed to become monoclonal antibody producing hybridoma cells. 

Any one of a number of methods well known in the an can be used lo 
identify the hybridoma cell which produces an antibody with the desired 

15 characteristics. These include screening the hybridomas with an ELISA assay, 
western blot analysis, or radioimmunoassay (Lutz et aL, Exp. Cell Res, / 75; 109- 
124 (1988)). 

Hybridomas secreting the desired antibodies are cloned and the class and 
subclass is detemiined using procedures known in the an (Campbell, A. M., 
20 Monoclonal Antibody Technology: Laboratory Techniques in Biochemistry and 
Molecular Biology, Elsevier Science Publishers, Amsterdam, The Netherlands 
(1984)). 

Techniques described for the production of single chain antibodies (U. S. 
Patent 4,946,778) can be adapted to produce single chain antibodies lo proteins of 
25 the present invention. ' ^ 

For polyclonal antibodies, antibody containing antisera is isolated from the 
immunized animal and is screened for the presence of antibodies with the desired 
specificity using one of the above-described procedures. 

The present invention further provides the above- described antibodies in 
30 detectably labelled fomi. Antibodies can be detectably labelled through the use of 
radioisotopes, affinity labels (such as biotin, avidin, etc.). enzymatic labels (such 
as horseradish peroxidase, alkaline phosphatase, etc.) fluorescent labels (such as 
FTTC or rhodamine, etc.), paramagnetic atoms, etc. Procedures for accomplishing 
such labeling are well-known in the art, for example see Stemberger et al., J. 
35 Histochem. Cytochem. 75:315 (1970); Bayer, E. A. et aL, Meth. Enzym. 62:308 
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(1979); Engval, E. et aL. Immunol, 709:129 (1972); Coding, J. W., J. ImmunoL 
Meth. 13:215 (1976)). 

The labeled antibodies of the present invention can be used for in vitro, in 
vivo, and in situ assays to idendfy cells or tissues in which a fragment of the 
5 Streptococcus pneumoniae genome is expressed. 

The present invention funher provides the above-described antibodies 
immobilized on a solid support. Examples of such solid supports include plastics 
such as polycarbonate, complex carbohydrates such as agarose and sepharose, 
acrylic resins and such as polyacrylamide and latex beads. Techniques for 

10 coupling antibodies to such solid supports are well known in the art (Weir. D. M. 
et aL, "Handbook of Experimental Immunology" 4th Ed., Blackwcll Scientific 
Publications, Oxford, England, Chapter 10 (1986); Jacoby, W. D. et al.^ Meth. 
Enzym. 34 Academic Press, N. Y. (1974)). The immobilized antibodies of the 
present invention can be used for in vitro, in vivo, and in situ assays as well as for 

15 immunoaffmity purificauon of the proteins of the present invention. 

3. Diagnostic Assays and Kits 

The present invention further provides methods to identify the expression 
of one of the ORFs of the present invention, or homolog thereof, in a test sample, 

20 using one of the DFs or antibodies of the present invention. 

in detail, such methods comprise incubating a test sample with one or more 
of the antibodies or one or more of the DFs of the present invention and assaying 
for binding of the DFs or antibodies to components within the test sample. 

Conditions for incubating a DF or antibody with a test sample vary. 

25 Incubation conditions depend on the format employed in the assay, the detection 
methods employed, and the type and nature of the DF or antibody used in the 
assay. One skilled in the an will recognize that any one of the commonly available 
hybridization, amplification or immunological assay formats can readily be adapted 
to employ the DFs or antibodies of the present invention. Examples of such assays 

30 can be found in Chard, T., An Introduction to Radioimmunoassay and Related 
Techniques, Elsevier Science Publishers, Amsterdam, The Netherlands (1986); 
Bullock, G. R. et aL, Techniques in Immunocytochemistry, Academic Press, 
Orlando, FL Vol. 1 (1982), Vol. 2 (1983), Vol. 3 (1985); Tijssen, P., Practice and 
Theory of Enzyme immunoassays: Laboratory Techniques in Biochemistry and 
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Molecular Biology, Elsevier Science Publishers, Amsterdam, The Netherlands 
(1985). 

TTie test samples of the present invention include cells, protein or membrane 
extracts of cells, or biological fluids such as sputum, blood, serum, plasma, or 
5 urine. The test sample used in the above-described method will vary based on the 
assay format, nature of the detection method and the tissues, cells or extracts used 
as the sample to be assayed. Methods for preparing protein extracts or membrane 
extracts of cells are well known in the art and can be readily be adapted in order to 
obtain a sample which is compatible with the system utilized. 

10 In another embodiment of the present invention, kits are provided which 

contain the necessary reagents to carry out the assays of the present invention. 

Sp'^cifically, the invention provides a companmentalized kit to receive, in 
close confinement, one or more containers which comprises: (a) a first container 
comprising one of the DFs or antibodies of the present invention; and (b) one or 

15 more other containers comprising one or more of the following: wash reagents, 
reagents capable of detecting presence of a bound DF or antibody. 

In detail, a compartmentalized kit includes any kit in which reagents are 
contained in separate containers. Such containers include small glass containers, 
plastic containers or strips of plastic or paper. Such containers allows one lo 

20 efficiendy transfer reagents from one compartment to another companment such 
that the samples and reagents are not cross-contaminated, and the agents or 
solutions of each container can be added in a quantitative fashion from one 
compartment to another. Such containers will include a container which will accept 
the test sample, a container which contains the antibodies used in the assay, 

25 containers which contain wash reagents (such as phosphate buffered saline; Tris- 
buffers, f/c), and containers which contain the reagents used to detect the bound 
antibody or DF. 

Types of detection reagents include labelled nucleic acid probes, labelled 
secondary antibodies, or in the alternative, if the primary antibody is labelled, the 
30 enzymatic, or antibody binding reagents which are capable of reacting with the 
labelled antibody. One skilled in the art will readily recognize that the disclosed 
DFs and antibodies of the present invention can be readily incorporated into one of 
the established kit formats which are well known in the art. 

35 4. Screening Assay for Binding Agents 
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Using the isolated proteins of the present invention, the present invention 
funher provides methods of obtaining and identifying agents which bind to a 
protein encoded by one of the ORFs of the present invention or to one of the 
fragments and the Streptococcus pneumoniae fragment and contigs herein 
5 described. 

In general, such methods comprise steps of: 

(a) contacting an agent with an isolated protein encoded by one of the 
ORFs of the present invention, or an isolated fragment of the Streptococcus 
pneumoniae genome; and 
10 (b) detennining whether the agent binds to said protein or said fragment. 

The agents screened in the above assay can be, but are not limited to, 
peptides, carbohydrates, vitamin derivatives, or other pharmaceutical agents. The 
agents can be selected and screened at random or rationally selected or designed 
using protein modeling techniques. 
15 For random screening, agents such as peptides, carbohydrates, 

pharmaceutical agents and the like are selected at random and arc assayed for their 
ability to bind to the protein encoded by the ORF of the present invention. 

Alternatively, agents may be rationally selected or designed. As used 
herein, an agent is said to be "radonally selected or designed" when the agent is 
20 chosen based on the configuration of the particular protein. For example, one 
skilled in the art can readily adapt currendy availabie procedures to generate 
peptides, pharmaceutical agents and the like capable of binding to a specific peptide 
sequence in order to generate rationally designed andpjeptide peptides, for example 
see Hurby et ai^ " Application of Synthetic Peptides: Antisensc Peptides," in 
25 Synthetic Peptides. A User's Guide, W. H. Freeman, NY (1992), pp. 289-307, 
and Kaspczak et aL, Biochemistry 28:9230-8 (1989), or pharmaccuucal agents, or 
the like. 

In addition to the foregoing, one class of agents of the present invention, as 
broadly described, can be used to conu-ol gene expression through binding to one 
30 of the ORFs or EMFs of the present invention. As described above, such agents 
can be randomly screened or rationally designed/selected. Targeting the ORF or 
EMF allows a skilled artisan to design sequence specific or element specific agents, 
modulating the expression of eidier a single ORF or muluple ORFs which rely on 
the same EMF for expression control. 
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One class of DNA binding agents are agents which contain base residues 
which hybridize or form a triple helix by binding to DNA or RNA. Such agents 
can be based on the classic phosphodiesier, ribonucleic acid backbone, or can be a 
variety of sulfhydryl or polymeric derivatives which have base attachment capacity. 
5 Agents suitable for use in these methods usually contain 20 to 40 bases and 

are designed to be complementary to a region of the gene involved in transcription 
(triple helix - see Lee er aL, NucL Acids Res. 6:3073 (1979); Cooney et aL, 
Science 241:456 (1988); and Dervan et aL. Science 257:1360 (1991)) or to the 
mRNA itself (antiscnse - Okano, 7. Neurochem. 55:560 (1991); 

10 Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression. CRC Press, 
Boca Raton. FL (1988)). Triple helix- formation optimally results in a shut-off of 
RNA transcription from DNA, while antisense RNA hybridization blocks 
translation of an mRNA molecule into polypeptide. Both techniques have been 
demonstrated to be effective in model systems. Information contained in the 

!5 sequences of the present invention can be used to design antisense and triple helix- 
forming oligonucleotides, and other DNA binding agents. 

5* Pharmaceutical Compositions and Vaccines 

The present invention further provides pharmaceutical agents which can be 

20 used to modulate the growth or pathogenicity of Streptococcus pneumoniae, or 
another related organism, in vivo or in vitro. As used herein, a "pharmaceutical 
agent" is defmed as a composition of matter which can be formulated using known 
techniques to provide a pharmaceutical compositions. As used herein, the 
"pharmaceutical agents of the present invention" refers the pharmaceutical agents 

25 which are derived from the proteins encoded by the ORFs of the present invention 
or are agents which are identified using the herein described assays. 

As used herein, a pharmaceutical agent is said to "modulate the growth 
pathogenicity of Streptococcus pneumoniae or a related organism, in vivo or in 
vitro" when the agent reduces the rate of growth, rate of division, or viability of 

30 the organism in question. The pharmaceutical agents of the present invention can 
modulate the growth or pathogenicity of an organism in many fashions, although 
an understanding of the underlying mechanism of action is not needed to practice 
the use of the pharmaceutical agents of the present invention. Some agents will 
modulate the growth by binding to an important protein thus blocking the biological 

35 activity of the protein, while other agents may bind to a component of the outer 
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surface of the organism blocking attachment or rendering the organism more prone 
to act the bodies nature immune system. Alternatively, the agent may comprise a 
protein encoded by one of the ORFs of the present invention and sen^e as a 
vaccine. The deveJopment and use of a vaccine based on outer membrane 

components are well known in the art. 

As used herein, a "related organism" is a broad term which refers to any 
organism whose growth can be modulated by one of the pharmaceutical agents of 
the present invention. In general, such an organism will contain a homolog of the 
protein which is the target of the pharmaceutical agent or the protein used as a 
vaccine. As such, related organisms do not need to be bacterial but may be fungal 
or viral pathogens. 

The pharmaceutical agents and compositions of the present invention may 
be administered in a convenient manner, such as by the oral, topical, intravenous, 
intraperitoneal, intramuscular, subcutaneous, intranasal or intradermal routes. The 
pharmaceutical compositions are administered in an amount which is effective for 
treating and/or prophylaxis of the specific indication. In general, they are 
administered in an amount of at least about 1 nig/lcg body weight and in most cases 
they will be administered in an amount not in excess of about 1 gylcg body weight 
per day. In most cases, the dosage is from about 0. 1 mg/kg to about 10 gAcg body 
weight daily, taking into account the routes of administration, symptoms, etc. 

The agents of the present invention can be used in native form or can be 
modified to form a chemical derivative. As used herein, a molecule is said to be a 
"chemical derivative" of anotiier molecule when it contains additional chemical 
moieties not normally a pan of the molecule. Such moieties may improve tiie 
molecule's solubility, absorption, biological half life. etc. The moieties may 
alternatively decrease the toxicity of the molecule, eliminate or attenuate any 
undesirable side effea of the molecule, etc. Moieties capable of mediating such 
effects are disclosed in. among otiier sources, REMINGTON'S 
PHARMACEUTICAL SCIENCES (1980) cited elsewhere herein. 

For example, such moieties may change an immunological character of the 
functional derivative, such as affinity for a given antibody. Such changes in 
immunomodulation activity are measured by the appropriate assay, such as a 
competitive type immunoassay. Modifications of such protein properries as redox 
or thermal stability, biological half-life, hydrophobicity. susceptibility to proteolytic 
degradation or the tendency to aggregate with carriers or into multimers also may 
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be effected in this way and can be assayed by methods well known to the skilled 
artisan. 

The therapeutic effects of the agents of the present invention may be 
obtained by providing the agent to a patient by any suitable means (e.g., inhalation, 
5 intravenously, intramuscularly, subcutaneously. enterally, or parentcrally). It is 
preferred to administer the agent of the present invention so as to achieve an 
effective concentration within the blood or tissue in which the growth of the 
organism is to be controlled. To achieve an effective blood concentration, the 
preferred method is to administer the agent by injection. The administration may be 

10 by continuous infusion, or by single or multiple injections. 

In providing a patient with one of the agents of the present invention, the 
dosage of the administered agent will vary depending upon such factors as the 
patient's age, weight, height, sex, general medical condition, previous medical 
history, etc. In general, it is desirable to provide the recipient with a dosage of 

15 agent which is in the range of from about 1 pg/kg to 10 mg/kg (body weight of 
patient), although a lower or higher dosage may be administered. The 
therapeutically effective dose can be lowered by using combinations of the agents 
of the present invention or another agent. 

As used herein, two or more compounds or agents are said to be 

20 administered "in combination" with each other when either (1) the physiological 
effects of each compound, or (2) the serum concentrations of each compound can 
be measured at the same time. The composition of the present invention can be 
administered concurrently with, prior to, or following the administration of the 
other agent. 

25 The agents of the present invention are intended to be provided to recipient 

subjects in an amount sufficient to decrease the rate of growth (as defined above) of 
the target organism. 

The administration of the agent(s) of the invention may be for either a 
"prophylactic" or "therapeutic" purpose. When provided prophylactically, the 

30 agent(s) are provided in advance of any symptoms indicative of the organisms 
growth. The prophylactic administration of the agent(s) serves to prevent, 
attenuate, or decrease the rate of onset of any subsequent infection. When 
provided therapeutically, the agent(s) are provided at (or shortly after) the onset of 
an indication of infection. The therapeutic administration of the compound(s) 
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serves to attenuate the pathological symptoms of the infection and to increase the 
rate of recovery. 

The agents of the present invention are administered to a subject, such as a 
manimal, or a patient, in a pharmaceuiically acceptable form and in a therapeutically 
effective concentration. A composition is said to be "phannacologically acceptable" 
if its administration can be tolerated by a recipient patient. Such an agent is said to 
be adniinistered in a "therapeutically effective amount" if the amount administered 
is physiologically significanL An agent is physiologically significant if its presence 
results in a detectable change in the physiology of a recipient patient. 

The agents of the present invention can be formulated according to known 
methods to prepare pharmaceutically useful compositions, whereby these materials, 
or their fiincUonai derivatives, are combined in a mixture with a pharmaceutically 
acceptable carrier vehicle. Suitable vehicles and their formulation, inclusive of 
other human proteins, e.g., human serum albumin, are described, for example, in 
15 REMINGTON'S PHARMACEUTICAL SCIENCES. 16th Ed., Osol, A., Ed.. 
Mack Publishing. Easton PA (1980). In order to form a. pharmaccmicaUy 
acceptable composition suitable for effective adminisu-aiion. such compositions will 
contain an effective amount of one or more of the agents of the present invention, 
together with a suitable amount of carrier vehicle. 

Additional pharmaceutical methods may be employed to control the durauon 
of action. Control release preparations may be achieved through the use of 
polymers to complex or absorb one or more of the agents of the present invention. 
The conu-oUed delivery may be effectuated by a variety of well known techniques, 
including formulation with macromolecules such as. for example, polyesters, 
polyamino acids, polyvinyl, pyrrolidone. ethylenevinylacetate, methylcellulose, 
carboxymcLhylcellulose. or protanune. sulfate, adjusting the concentration of the 
macromolecules and the agent in the formulation, and by appropriate use of 
methods of incorporation, which can be manipulated to effectuate a desired lime 
course of release. Another possible method to control the duration of action by 
controlled release preparaUons is to incorporate agents of the present invention into 
panicles of a polymeric material such as polyesters, polyamino acids, hydrogels. 
polydactic acid) or ethylene vinylacetate copolymers. Alternatively, instead of 
incorporating these agents into polymeric particles, it is possible to entrap these 
materials in microcapsules prepared, for example, by coacervation techniques or by 
inierfacial polymerization with, for example, hydroxymethylccllulose or gelatine- 
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microcapsules and poly(methylmethacylaie) mictficapsules, respectively, or in 
colloidal drug delivery systems, for example, liposomes, albumin microspheres, 
microemulsions, nanoparticles, and nanocapsules or in macrocmulsions. Such 
techniques arc disclosed in REMINGTON'S PHARMACEUTICAL SCIENCES 
5 (1980). 

The invention further provides a pharmaceutical pack or kit comprising one 
or more containers filled with one or more of the ingredients of the pharmaceutical 
compositions of the invention. Associated with such container(s) can be a notice in 
the form prescribed by a governmental agency regulating the manufacture, use or 
10 sale of pharmaceuticals or biological products, which notice reflects approval by 
the agency of manufacture, use or sale for human administration. 

In addition, the agents of the present invention may be employed in 
conjunction with other therapeutic compounds. 

15 6- Shot-Gun Approach to Megabase DNA Sequencing 

The present invention further demonstrates that a large sequence can be 
sequenced using a random shotgun approach. This procedure, described in detail 
in the examples that follow, has eliminated the up front cost of isolating and 
ordering overlapping or contiguous subclones prior to the stan of the sequencing 
20 protocols. 

Certain aspects of the present invention are described in greater detail in the 
examples that follow. The examples are provided by way of illustration. Other 
aspects and embodiments of the present invention are contemplated by the 
inventors, as will be clear to those of skill in the art from reading the present 
25 disclosure. 

ILLUSTRATIVE EXAMPLES 

LIBRARIES AND SEQUENCING 
30 1. Shotgun Sequencing Probability Analysis 

The overall strategy for a shotgun approach to whole genome sequencing 

follows from the Lander and Watemian (Landerman and Waterman, Genomics 

2:231 (1988)) application of the equation for the Poisson distribution. According 

to this treatment, the probability, P , that any given base in a sequence of size L, in 

35 nucleotides, is not sequenced after a cenain amount, n, in nucleotides, of random 

0 
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sequence has been determined can be calculated by the equation P = e-m, where m 
is Un, the fold coverage. For instance, for a genome of 2.8 Mb. m=l when 2.8 
Mb of sequence has been randomly generated (IX coverage). APthat point. P = 
e-1 = 0.37. The probability that any given base has not been sequenced is the same 
as the probability that any region of the whole sequence L has not been dctermiRed 
and. therefore, is equivalent to the fraction of the whole sequence that has yet to be 
determined. Thus, at one-fold coverage, approximately 37% of a polynucleotide of 
size L, in nucleotides has not been sequenced. When 14 Mb of sequence has been 
generated, coverage is 5X for a 2.8 Mb and the unsequenced fraction drops to 
.0067 or 0.67%. 5X coverage of a 2.8 Mb sequence can be attained by sequencing 
approximately 17.000 random clones from both insen ends with an average 
sequence read length of 410 bp. 

Similarly, the total gap length, G. is determined by the equation G = Le-m 
and the average gap size. g. follows the equation, g = Un. Thus, 5X coverage 
leaves about 240 gaps averaging about 82 bp in size in a sequence of a 
polynucleotide 2.8 Mb long. 

The treatment above is essentially that of Lander and Waterman, Genomics 
2: 231 (1988). 

2- Random Library Construction 

In order to approximate the random model described above during actual 
sequencing, a nearly ideal library of cloned genomic fragments is required. The 
following library construction procedure was developed to achieve this end. 

Streptococcus pneumoniae DNA is prepared by phenol extraction A 
mixture containing 200 ng DNA in 1 .0 ml of 300 mM sodium acetate, lOmMTris- 
HCl, 1 mM Na-EDTA, 50% glycerol is processed through a nebulizer (IPI Medical 
Products) with a stream of nitrogen adjusted to 35 Kpa for 2 minutes. The 
sonicated DNA is ethanol precipitated and redissolved in 500 fxl TE buffer. 

To create blunt-ends, a 100 ^ aliquot of the resuspended DNA is digested 
with 5 units of BAL31 nuclease (New England BioLabs) for 10 min at 30»C in 200 
H\ BAL31 buffer. The digested DNA is phenol-extracted, ethanol-precipitated 
redissolved in 100 jxl TE buffer, and then size-fractionated by electrophoresis 
through a 1.0% low melting temperature agarose gel. The section containing DNA 
fragments 1 .6-2.0 kb in size is excised from the gel. and the LGT agarose is melted 
and the resulting solution is extracted with phenol to separate the agarose from the 
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DNA. DNA is elhanoJ prccipiiaied and redissolved in 20 ^il of TE buffer for 
ligation to vector. 

A two-step ligation procedure is used to produce a plasmid library with 
97% inserts, of which >99% were single inserts. The first ligation mixture (50 ul) 
5 contains 2 ^g of DNA fragments, 2 fig pUC18 DNA (Phamiacia) cut with Smal 
and dephosphoryiated with bacterial alkaline phosphatase, and 10 units of T4 ligase 
(GIBCO/BRL) and is incubated at 14°C for 4 hr. The ligation mixture then is 
phenol extracted and ethanol precipitated, and the precipitated DNA is dissolved in 
20 p.! TE buffer and electrophoresed on a 1 ,0% low melting agarose gel. Discrete 

10 bands in a ladder are visualized by ethidium bromide-staining and UV illununation 
and identified by size as insen (I), vector (v). v+I, v+2i, v-h3i, etc. The ponion of 
the gel containing v+I DNA is excised and the v-kI DNA is recovered and 
resuspended into 20 TE. The v+I DNA then is blunt-ended by T4 polymerase 
treatment for 5 min. ai 37°C in a reaction mixture (50 ul) containing the v+I linears, 

15 500 ^iM each of the 4 dNTPs, and 9 units of T4 polymerase (New England 
BioLabs), under recommended buffer conditions. After phenol extraction and 
ethanoi precipitation the repaired v+I linears are dissolved in 20 |xl TE. The final 
ligation to produce circles is carried out in a 50 |J.l reaction containing 5 )il of v+I 
linears and 5 units of T4 ligase at 14°C overnight. After 10 min. at KfC the 

20 following day. the reaction mixture is stored at -20'*C. 

This two-stage procedure results in a molccuiarly random collection of 
single-insert plasmid recombinants with minimal contamination from doublc-insen 
chimeras (<1%) or free vector (<3%). 

Since deviation from randomness can arise from propagation the DNA in 

25 the host, £. coli host cells deficient in all recombination and restriction functions 
(A. Greener, Strategies 3 (1):5 (1990)) are used to prevent rearrangements, 
deletions, and loss of clones by restriction. Furthermore, transformed cells are 
plated directly on antibiotic diffusion plates to avoid the usual broth recovery phase 
which allows multiplication and selection of the most rapidly growing cells. 

30 Plating is carried out as follows. A 100 M-l aliquot of Epicurian Coli SURE 

II Supercompetenl Cells (Siratagene 200152) is thawed on ice and transferred to a 
chilled Falcon 2059 tube on ice. A 1.7 |il aliquot of 1.42 M beta-mercaptoethanol 
is added to the aliquot of cells to a final concentration of 25 mM. Cells are 
incubated on ice for 10 min. A I jil aliquot of the final ligation is added to the cells 

35 and incubated on ice for 30 min. The cells arc heal pulsed for 30 sec. at 42°C and 
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placed back on ice for 2 min. The outgrowth pcrioi in liquid culture is eliminated 
from this protocol in order to mininiize the preferential growth of any given 
transformed cell. Instead the transformation mixture is plated directly on a nutrient 
rich SOB plate containing a 5 ml bottom layer of SOB agar (5% SOB agar 20 g 
5 tryptonc. 5 g yeast extract, 0.5 g NaCl, 1 .5% Difco Agar per liter of media). The 5 
ml bottom layer is supplemented with 0.4 ml of 50 mg/m] ampicillin per 100 ml 
SOB agar. The 15 ml top layer of SOB agar is supplemented with 1 ml X-Gal 
(2%), 1 ml MgCl (1 M). and 1 ml MgSO /lOO ml SOB agar. The 15 ml top layer 
is poured just prior to plaung. Our uter is approximately 100 colonics/ 10 jil aliquot 
10 of transformationT ^ 

All colonies are picked for template preparation regardless of size. Thus, 
only clones lost due to "poison" DNA or deleterious gene products arc deleted from 
the library, resulung in a slight increase in gap number over that expected. 
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3. Random DNA Sequencing 

High quality double stranded DNA plasmid templates are prepared using a 
"boiling bead" method developed in collaboration with Advanced Genetic 
Technology Corp. (Gaithersburg. MD) (Adams et ai. Science 252:1651 (1991): 
Adams et aL. Nature 355:622 (1992)). Plasmid preparauon is performed in a 96- 
well formal for all stages of DNA preparation from bacterial growth through final 
DNA purification. Template concentraUon is determined using Hoechst Dye and a 
Millipore Cytofluor. DNA concentrations are not adjusted, but low-yielding 
templates are identified where possible and not sequenced. 

Templates arc also prepared from two Streptococcus pneumoniae lambda 
genomic libraries. An amplified library is constructed in the vector Lambda GEM- 
12 (Promega) and an unamplified library is constructed in Lambda DASH II 
(Stratagene). In particular, for the unamplified lambda library. Streptococcus 
pneumoniae DNA (> 100 kb) is partially digested in a reaction mixture (200 ul) 
containing 50 ^g DNA. IX Sau3AI buffer. 20 units Sau3AI for 6 min. at 23»C. 
The digested DNA was phenol-extracted and electrophoresed on a 0.5% low 
melting agarose gel at 2V/cm for 7 hours. Fragments from 15 to 25 kb are excised 
and recovered in a final volume of 6 ul. One ^1 of fragments is used with 1 ^il of 
DASHII vector (Stratagene) in the recommended ligation reaction. One jil of the 
ligation mixture is used per packaging reaction following the recommended 
protocol with the Gigapack II XL Packaging Extract (Stratagene. #22771 1). Phage 
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are plated directly without amplification from the packaging mixture (after dilution 
with 500 pj of recommended SM buffer and chloroform treatment). Yield is about 
2.5x1 0^ pfu/ul. The amplified library is prepared essentially as above except the 
lambda GEM- 12 vector is used. After packaging, about 3.5x10^ pfu are plated on 
5 the restrictive NM539 host. The lysate is harvested in 2 ml of SM buffer and 
stored frozen in 7% dimcthylsulfoxide. The phage titer is approximately IxiO^ 
pfu/mJ. 

Liquid lysates (100 ^l) are prepared from randomly selected plaques (from 
the unamplified library) and template is prepared by long-range PCR using T7 and 

1 0 T3 vector-specific primers. 

Sequencing reactions are carried out on plasmid and/or PCR templates 
using the AB Catalyst LabStation with Applied Biosystems PRISM Ready 
Reaction Dye Primer Cycle Sequencing Kits for the M 13 forward (Ml 3-21) and 
the M13 reverse (M13RP1) primers (Adams et a/.. Nature 368:474 (1994)). Dye 

15 terminator sequencing reactions are carried out on the lambda templates on a 
Perkin-Elmcr 9600 Thermocycler using the Applied Biosystems Ready Reaction 
I>ye Terminator Cycle Sequencing kits. T7 and SP6 primers are used to sequence 
the ends of the inserts from the Lambda GEM- 12 library and T7 and T3 primers are 
used to sequence the ends of the inserts from the Lambda DASH II library. 

20 Sequencing reactions are performed by eight individuals using an average of 
fourteen AB 373 DN A Sequencers per day. All sequencing reactions are analyzed 
using the Stretch modification of the AB 373, primarily using a 34 cm well-to-read 
distance. The overall sequencing success rate very approximately is about 85% for 
Ml 3-21 and M13RP1 sequences and 65% for dye-terminator reactions. The 

25 average usable read length is 485 bp for Ml 3-21 sequences, 445bp for M13RP1 
sequences, and 375 bp for dye-terminator reactions. 

Richards et aL Chapter 28 in AUTOMATED DNA SEQUENCING AND 
ANALYSIS, M. D. Adams, C. Fields, J. C. Venter, Eds., Academic Press, 
London, (1994) described the value of using sequence from both ends of 

30 sequencing templates to facilitate ordering of contigs in shotgun assembly projects 
of lambda and cosmid clones. We balance the desirability of both-end sequencing 
(including the reduced cost of lower total number of templates) against shorter 
read-lengths for sequencing reactions performed with the M13RP1 (reverse) primer 
compared to the Ml 3-21 (forward) primer. Approximately one-half of the 

35 templates are sequenced from both ends. Random reverse sequencing reactions arc 
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done based on successful forward sequencing reactions.' Some M13RP1 
sequences are obtained in a semi-directed fashion: MI 3-21: sequences pointing 
outward at the ends of contigs are chosen for M13RP1 sequencing in an effort to 
specifically order contigs. 

4. Protocol for Automated Cycle Sequencing 
The sequencing is carried out using ABI Catalyst robots and AB 373 
Automated DNA Sequencers. The Catalyst robot is a publicly available 
sophisticated pipetting and temperature control robot which has been developed 
specifically for DNA sequencing reactions. The Catalyst combines pre-aliquoted 
templates and reaction mixes consisting of deoxy- and dideoxynucleoiidcs, the 
thermostable Taq DNA polymerase, fluorescently-labelled sequencing primers, and 
reaction buffer. Reaction mixes and templates are combined in the wells of an 
aluminum 96-well thermocychng plate. Thirty consecutive cycles of linear 
amplification (i.e.., one primer synthesis) steps are performed including 
denaturation, annealing of primer and template, and extension: i.e.. DNA 
synthesis. A heated lid with rubber gaskets on the thermocycling plate prevents 
evaporation without the need for an oil overlay. 

Two sequencing protocols are used: one for dye-labellcd primers and a 
20 second for dye-labelled dideoxy chain terminators. The shotgun sequencing 
involves use of four dye-labelled sequencing primers, one for each of the four 
terminator nucleotide. Each dye-primer is labelled with a different fluorescent dye, 
permitting the four individual reactions to be combined into one lane of the 373 
DNA Sequencer for electrophoresis, detection, and base-calling. ABI currently 
supplies prc-mixcd reaction mixes in bulk packages containing all the necessary 
non-template reagents for sequencing. Sequencing can be done with both plasmid 
and PCR- generated templates with both dye-primers and dye- terminators with 
approximately equal fidelity, although plasmid templates generally give longer 
usable sequences. 

Thirty-two reactions are loaded per AB373 Sequencer each day. for a total 
of 960 samples. Elecu-ophoresis is run overnight following the manufacturer s 
protocols, and the data is collected for twelve hours. Following electrophoresis 
and fluorescence detection, the ABI 373 performs automatic lane tracking and ba.se- 
calling. The lane-tracking is confirmed visually. Each sequence electropherogram 
(or fluorescence lane trace) is inspected visually and assessed for quality. Trailing 
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sequences of low quality are removed and the sequence itself is loaded via software 
to a Sybase database (archived daily lo Smm tape). Leading vector polylinker 
sequence is removed automatically by a software program. Average edited lengths 
of sequences from the standard ABI 373 are around 400 bp and depend mostly on 
5 the quality of the template used for the sequencing reaction. ABI 373 Sequencers 
converted to Stretch Liners provide a longer electrophoresis path prior to 
fluorescence detection and increase the average number of usable bases to 500-600 
bp. 

10 INFORMATICS 

1. Data Management 

A number of information management systems for a large-scale sequencing 
lab have been developjed. (For review see, for instance. Keriavage er al.. 
Proceedings of the Twenty-Sixth Annual Hawaii International Conference on 

15 System Sciences. IEEE Computer Society Press, Washington D. C, 585 (1993)) 
The system used to collect and assemble the sequence data was developed using the 
Sybase relational database management system and was designed to automate data 
flow wherever possible and lo reduce user error. The database stores and 
correlates all information collected during the entire operation from template 

20 preparation to final analysis of the genome. Because the raw output of the ABI 373 
Sequencers was based on a Macintosh platform and the data management system 
chosen was based on a Unix platform, it was necessary to design and implement a 
variety of multi- user, client-server applications which allow the raw data as well as 
analysis results to flow seamlessly into the database with a minimum of user effort. 

25 

2. Assembly 

An assembly engine (TIGR Assembler) developed for the rapid and 
accurate assembly of thousands of sequence fragments was employed to generate 
contigs. The TIGR assembler simultaneously clusters and assembles fragments of 

30 the genome. In order to obtain the speed necessary to assemble more than 10^ 
fragments, the algorithm builds a hash table of 12 bp oligonucleotide subsequences 
to generate a list of potential sequence fragment overlaps. The number of potential 
overlaps for each fragment determines which fragments arc likely to fall into 
repetitive elements. Beginning with a single seed sequence fragment, TIGR 

35 Assembler extends the .current contig by attempting to add the best matching 
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fragment based on oligonucleotide content. The contig and candidate fragment are 
ahgned using a modified version of the Smith-Waterman algorithm which provides 
for optimal gapped alignments (Waterman. M. S.. Methods in Enzvmology 
J 64:165 (1988)). The contig is extended by the fragment only if strict criteria for 
5 the quality of the match are met The match criteria-include the minimum length of 
overlap, the maximum length of an unmatched end. and the minimum percenuge 
match. These criteria are automatically lowered by the algorithm in regions of 
mimmal coverage and raised in regions with a possible repetitive element The 
number of potential ovcriaps for each fragment determines which fragments are 
m hkely to fail into repetitive elements. Fragments representing the boundaries of 
rcpeuuve elements and potentially chimeric fragments are often rejected based on 
panial mismatches at the ends of alignments and excluded from the current contig 
TIGR Assembler is designed to take advantage of clone size information coupled 
w«h .sequencing from both ends of each template. It enforces the constraint that 
1 5 sequence fragments from two ends of the same template point toward one another 
m the conug and are located within a cenain range of base pairs (definable for each 
clone based on the known clone size range for a given library). 

The process resulted in 391 contigs as represented by SEQ ID NOs:I-391. 

3. Identifying Genes 

The predicted coding regions of the Streptococcus pneumoniae genome 
were mitially defined with the program GeneMark. which finds ORFs using a 
probabilistic classification technique. The predicted coding region .sequences were 
used in searches against a database of all nucleotide sequences from GenBank 

25 (October, 1997). using the BLASTN search method to identify overlaps of 50 or 
more nucleoudes with at least a 95% identity. Those ORFs with nucleotide 
sequence matches are shown in Table 1. The ORFs without such matches were 
translated to protein sequences and compared to a non-redundant database of 
known proteins generated by combining the Swiss-prot. PIR and GenPept 

0 databases. ORFs that matched a database protein with BLASTP probability less 
than or equal to 0.01 are shown in Table 2. The table also lists assigned functions 
based on the closest match in the databa.ses. ORFs that did not match protein or 
nucleotide sequences in the databases at these levels are shown in Table 3 
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ILLUSTRATIVE APPLICATIONS 

1. Production of an Antibody to a Streptococcus pneumoniae 
Protein 

Substantially pure protein or polypeptide is isolated from the transfected or 
5 transformed cells using any one of the methods Icnown in the art. The protein can 
also be produced in a recombinant prokaryoiic expression system, such as £. ct>//, 
or can be chemically synthesized. Concentration of protein in the fmaJ preparation 
is adjusted, for example, by concentration on an Amicon filter device, to the level 
of a few micrograms/ml. Monoclonal or polyclonal antibody to the protein can 
10 then be prepared as follows. 

2. Monoclonal Antibody Production by Hybridoma Fusion 

Monoclonal anubody to epitopes of any of the peptides identified and 
isolated as described can be prepared from murine hybridomas according to the 

15 classical method of Kohler, G. and Milstein, C, Nature 256:495 (1975) or 
modifications of the methods thereof. Briefly, a mouse is repetitively inoculated 
with a few micrograms of the selected protein over a period of a few weeks. The 
mouse is then sacrificed, and the antibody producing cells of the spleen isolated. 
The spleen cells are fused by means of polyethylene glycol with mouse myeloma 

20 cells, and the excess unfused cells destroyed by growth of the system on selective 
media comprising aminopicrin (HAT media). The successfully fused cells arc 
diluted and aliquots of the dilution placed in wells of a microtilcr plate where 
growth of the culture is continued. Antibody-producing clones are identified by 
detection of antibody in the supernatant fluid of the wells by immunoassay 

25 procedures, such as ELISA, as originally described by Engvall, E., Meth, 
EnzymoL 70:419 (1980), and modified methods thereof. Selected positive clones 
can be expanded and their monoclonal antibody product harvested for use. Detailed 
procedures for monoclonal antibody production are described in Davis, L. et aL, 
Basic Methods in Molecular Biology, Elsevier, New York. Section 2 1-2 ( 1989). 
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3. Polyclonal Antibody Production by Immunization 

Polyclonal antiserum containing antibodies to heterogenous epitopes of a 
single protein can be prepared by immunizing suitable animals with the expressed 
protein described above, which can be unmodified or modified to enhance 
inrununogenicity. Effective polyclonal antibody production is affected by many 
factors related both to the antigen and the host species. For example, small 
molecules lend to be less inununogenic than others and may require the use of 
carriers and adjuvant. Also, host animals vary in response to site of inoculations 
and dose, with both inadequate or excessive doses of antigen resulting in low titer 
antisera. Small doses (ng level) of anugen administered at multiple intradermal 
sites appears to be most reliable. An effective immunization protocol for rabbits 
can be found in Vaitukaitis. J. et at., J. Clin. Endocrinol. Mctab. 33:988-991 
(1971). 

Booster injections can be given at regular intervals, and antiserum harvested 
when antibody titer thereof, as determined semi-quantitatively, for example, by 
double immunodiffusion in agar against known concentrations of the antigen, 
begins to fall. Sec. for example. Ouchterlony, O. et ai. Chap. 19 in: Handbook of 
Experimental Immunology. Wier, D.. ed. Blackwell (1973). Plateau concentration 
of antibody is usually in the range of 0.1 to 0.2 mg/ml of serum (about 12M). 
Affinity of the anu.sera for the antigen is determined by preparing compeutive 
binding curves, as described, for example, by Fisher. D., Chap. 42 in: Mamml of 
Clinical Immunology, second edition. Rose and Friedman, eds.. Amer. Soc. For 
Microbiology. Washington. D. C. ( 1 980) 

Antibody preparations prepared according to either protocol are useful in 
quantitative immunoassays which determine concentrations of antigen-bearing 
substances in biological samples: they are also used semi- quantitatively or 
qualitauvely to identify the presence of antigen in a biological sample. In addition, 
antibodies arc useful in various animal models of pneumococcal disease as a means 
of evaluating the protein used to make the antibody as a potential vaccine target or 
as a means of evaluating the antibody as a potemial immunotherapeutic or 
immunoprophylactic reagent. 
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4. Preparation or PGR Primers and Amplification of DNA 

Various fragments of the Streptococcus pneumoniae genome, such as those 
of Tables 1-3 and SEQ ID NOS: 1-391 can be used, in accordance with the present 
invention, to prepare PGR primers for a variety of uses. The PGR primers are 
5 preferably at least 15 bases, and more preferably at least 18 bases in length. When 
selecting a primer sequence, it is preferred that the primer pairs have approximately 
the same G/G ratio, so that melting temperatures are approximately the same. The 
PGR primers and amplified DNA of this Example find use in the Examples that 
follow. 

10 

5. Gene expression from DNA Sequences Gorresponding to 

ORFs 

A fragment of the Streptococcus pneumoniae genome provided in Tables I - 
3 is introduced into an expression vector using conventional technology. 

15 Techniques to transfer cloned sequences into expression vectors that direct protein 
translation in mammalian, yeast, insect or bacterial expression systems are well 
known in the art. Commercially available vectors and expression systems are 
available from a variety of suppliers including Stratagene (La JoUa, Galifomia), 
Promega (Madison, Wisconsin), and Inviirogen (San Diego, Galifomia). If 

20 desired, to enhance expression and facilitate proper protein folding, the codon 
context and codon pairing of the sequence may be optimized for the particular 
expression organism, as explained by Hatfield era/.. U. S. Patent No. 5,082,767, 
incorporated herein by this reference. 
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The foIJowing is provided as one exemplary method to generate 
poiypeptide(s) from cloned ORFs of the Streptococcus pneumoniae genome 
fragment. Bacterial ORFs generally lack a poly A addition signal. TTje addition 
signal sequence can be added to the construct by. for example, splicing out the poly 
5 A addition sequence from pSG5 (Stratagene) -using Bgll and Sail restriction 
endonuciease enzymes and incorporating it into the mammalian expression vector 
pXTl (Stratagene) for use in eukaryotic expression systems. pXTl contains the 
LTRs and a ponion of the gag gene of Moloney Murine Uukemia Virus The 
positions of the LTRs in the consuiict allow efficient stable transfection The 
10 vector includes the Herpes Simplex thymidine kinase promoter and the selectable 
neomycin gene. The Streptococcus pneumoniae DNA is obtained by PCR from the 
bacterial vector using oligonucleotide primers complementary to the Streptococcus 
pneumoniae DNA and containing restriction endonuciease sequences for PstI 
incorporated into the 5' primer and Bglll at the 5" end of the corresponding 
15 Streptococcus pneumoniae DNA 3' primer, taking care to ensure that the 
Streptococcus pneumoniae DNA is positioned such that its folk>^ with the poly 
A addition sequence. The purified fragment obtained from the resulting PCR 
reaction is digested with Pstl. blunt ended with an exonuclcasc. digested with 
Bglll. purified and ligated to pXTl, now containing a poly A addition sequence 
20 and digested Bglll. 

The ligated product is transfected into mouse NIH 3T3 cells using 
Lipofecun (Ufe Technologies, Inc., Grand Island. New York) under conditioas 
outlined in the product specification. Positive transfectants are selected after 
growing the transfected cells in 600 ug/ml G4I8 (Sigma, St. Louis, Missouri) 
25 The protein is preferably released into the supernatant. However if the protein has 
membrane binding domains, the protein may additionally be retained within the cell 
or expression may be restricted to the cell surface. Since it may be necessary to 
purify and locate the transfected product, synthetic 15-mer peptides synthesized 
from the predicted Streptococcus pneumoniae DNA sequence are injected into mice 
to generate antibody to the polypeptide encoded by the Streptococcus pneumoniae 
DNA. 
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AJiematively and if antibody production 4s-roi possible, the Streptococcus 
pneumoniae DNA sequence is additionally incorporated into eukaryotic expression 
vectors and expressed as, for example, a globin fusion. Antibody to the globin 
moiety then is used to purify the chimeric protein. Corresponding protease 
5 cleavage sites are engineered between the globin moiety and the polypeptide 
encoded by the Streptococcus pneumoniae DNA so that the latter may be freed 
from the formed by simple protease digestion. One useful expression vector for 
generating globin chinierics is pSG5 (Stratagene). This vector encodes a rabbit 
globin. Intron II of the rabbit globin gene facilitates splicing of the expressed 

10 transcript, and the polyadenylation signal incorporated into the construct increases 
the level of expression. These techniques are well known to those skilled in the an 
of molecular biology. Standard methods are published in methods texts such as 
Davis et aL, cited elsewhere herein, and many of the methods are available from the 
technical assistance representatives from Su^tagcne. Life Technologies, Inc., or 

15 Promega. Polypeptides of the invention also may be produced using in vitro 
u^slation systems such as in vitro ExprcssTM Translation Kit (Stratagene). 

While the present invention has been described in some detail for purposes 
of clarity and understanding, one skilled in the art will appreciate that various 
changes in form and detail can be made without depaning from the true scope of 

20 the invention. 

All patents, patent applications and publications referred to above arc 
hereby incorporated by reference. 
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(I) GENERAi. INFORMATION: 



(i) APPLICANT: 



Charles Kunsch 



Gil H. Choi 



Pacrick S. Dillon 



Craig A. Rosen 



Steven C. Barash 



Michael R. Fannon 



Brian A . Dougherty 



tiij TITLE OF INVENTION: Streptococcus pneumoniae Polynucleotides and Sequences 

(iii) NUMBER OF SEQUENCES: 391 

(iv) CORRESPONDENCE ADDRESS: 

{A> ADDRESSEE: Human Genome Sciences, Inc. 

(B) STREET: 9410 Key West Avenue 

(C) CITY: RocJcville 

(D) STATE: Maryland 

(E) COUNTRY: USA 

(F) ZIP: 20850 

(V) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Diskette . 3.50 inch. 1.4Mb storage 

(B) COMPUTER: HP Vectra 486/33 

(C) OPERATING SYSTEM: MSDOS version 6.2 

(D) SOFTWARE: ASCII Text 

(vi) CURRENT APPLICATION DATA: 
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ACGATTTTCC TTCAAATTTG GAGGTTCAAG GTCCTGTACA 
CTTTTAATGA GATGTCCCAT GATTTGCAGG TAAGCTTTGA 
CAGAAAAGGG CTTGATGATT GCCCAGTTGT CCCATGATAT 
TCCAAGCGAC GGTAGAAGGG ATTTTGGATG CGATTATCAA 
ATCTAGCAAC CATTGGACGC CAGACGGAGA GGCTCAATAA 
TTTTGACCCT AAACACAGCT AGAAATCACG TCGAAACTAC 
TGGACAAGCT CTTAATTGAG TGCATGAGTG AATTTCACTT 
GAGATGTCCA CTTGCAGGTA ATCCCAGAGT CTGCCCGCAT 
TTTCTCCTAT CTTGGTGAAT CTGGTCGATA ACGCTTTTAA 
AGCTGGAAGT CCTCCCTAAG CTGGAGAAGG ACCACCTTTC 
GGCAGGGTAT TGCCCCAGAG GATTTGGAAA ATATTTTCAA 
CTTCGCGTAA CATGAAGACA GGTGGTCATG GATTAGGACT 
CCCATCAATT GGGTGGGCAA ATCACAGTCA GCAGCCAGTA 
CCCTCGTTCT CAACCTCTCT GCTAGTGAAA ATAAAGCCTA 
CTATTCATGG TAGAATAGAT TTTCTCTGAA ATATCAGCAG 
CAGCTGTCTT ATGACAACTA ACCTTGGCTG TTTAGCCGAA 
(2) IKFORMATION FOR SEQ ID NO: 30: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 97 6 9 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : double 

(D) TOPOLOGY: linear 



AT«»€;%3CAA 
TTCCTTGGAA 
TAAGACTCCT 
GGAGTCGGAG 
ACTGCTTGAG 
CAGTAAACAC 
TTTGATTGAG 
TGAGGGAGAT 
ATATTCTCCT 
AATCAGTGTG 
ACCCCTTTAT 
TCCGATTGCC 
CCGTCTAGGA 
AAACCCCTTT 
GAAAGCATGA 
GCGCATCTGC 



TTAGGGCAAA 

GAAAGCGAAC 

ATCACTTCGA 

CAAGCTCATT 

GACTTGAATT 

ACTATTTTTC 

CAGGAGAGAA 

TATCCTAAGC 

CCAGGAACCA 

ACCGATGAAG 

CGTGTCGAAA 

CCTGAATTGG 

AGT ACCTTT A 

ACAAATCCAG 

ACCTCGTCAA 

ACCC 



9360 
9420 
94B0 
9540 
9600 
9660 
9'720 
9780 
9840 
9900 
9960 
10030 
10080 
10140 
10200 
10254 



txi) SEQUENCE DESCRIPTION: SEO ID NO: 30: 
CCCGCGACTA TCGATAACAC TTGACTTGGT AGCCCCACAT TTTGGACAAC 
CCTCCTTATC Gr rilCTTTT CATTATACCA TTTTTTAAGC GATTCCCAAA 
TTTTTGCTTG ACAAGTTTTT TGTTTTGTTG TATTATTTAA TTAAGACAAC 
AAAGGAGACT AAGATGTCCT GGACATTTGA CAACAAAAAA CCCATCTATT 
GGAGAAAATC AACCT-TCAGA TTGTTTCCCA TACACTGGAA CCCAATCAAC 
CGTGAGGACC TAGCTAGCGA GGCTGGTCTC AATCCCAATA CCATCCAAAC 
GACCTTGAAC GAGAAGGATT TGTCTACAGC AAGCGAACAA CTGCACGATT 
GATAAGCAGC TAATCGCCCA GTCACGCAAA CAATTATCAG AACAAGAATT 



GCATCCTTTC 
ACAATTCTTC 
AAGGTAAAAG 
TACAGATTAT 
AACTTCCAAC 
ACCCTTATCA 
TGTGACTAAG 
GCAACACTTC 



60 
120 
160 
240 
300 
360 
420 
480 
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GTTTCCTCCA 


TGACCCATTT 


TGGCTATGAA 


AAAGAAGAAC 


TACCAGGCGT 


AGTCACTGAT 


540 


TATATTAAAG 


CAGTTTAAGC 


CTATGTCATT 


ACTAGTATTT 


GAAAATGTAT 


CCAAATCATA 


600 


TGGAGCAACA 


CCAGCCCTTG 


AAAATGTTTC 


TCTTGACATT 


CCAGCTGGAA 


AAATTGTCGG 


660 


CCTTCTTGGG 


CCAAACGGCT 


CAGGAAAAAC 


AACCCTGATT 


AAACTAATTA 


ATGGCCTCTT 


720 


ACAACCAGAT 


CAAGGACGTG 


TCCTCATCAA 


CGACATGGAC 


CCAAGCCCAG 


CAACCAAGGC 


780 


CGTTGTAGCT 


TATTTGCCTG 


ATACGACCTA 


TCTCAATGAG 


CAAATGAAGG 


TCAAAGAAGC 


840 


CCTAACCTAC 


TTCAAGACCT 


TCTATAAAGA 


TTGTCAGATC 


TTCAACGCCC 


CCATCATCTA 


900 


CTTGCAGACC 


TGGGCATTGA 


TGAAAATAGT 


CCTCTCAAGA 


AACTATCAAA 


AGGAAACAAA 


960 


GAAAAGGTTC 


AACTGATTTT 


GGTTATGAGC 


CGTGATGCTC 


GTCTCTATGT 


TTTGGACGAA 


1020 


CCCATTGGTG 


GGGTGGATCC 


AGCAGCCCCT 


GCTTATATCC 


TCAATACCAT 


TATCAACAAC 


1080 


TACTCACCAA 


CTTCTACCGT 


TTTGATTTCT 


ACCCACTTGA 


TTTCTGATAT 


CGACCCAATC 


1140 


TTGGATGAAA 


TTGTCTTCCT 


AAAAGACGGA 


AAAGTCGTCC 


GTCAAGGAAA 


TGTAGATGAT 


1200 


ATTCGCTACG 


AGTCAGGTGA 


ATCCATTGAC 


CAACTCTTCC 


GTCAGaATTT 


AAGGCCTAAG 


1260 


CAAAGGAGAT 


TATTTATGTT 


TTCGAATTTA 


GTTCGCTACC 


AATTTAAAAA 


TGTTAACAAG 


1320 


TGCTATTTAG 


CCCTCTACGC 


AGCCGTGCTA 


GTCCTTTCTG 


CCCTCATCCG 


AATACACACA 


1380 


CAAGGCTTTA 


AAAATCTACC 


TTACCAAGAA 


AGTCAGGCTA 


CTATGCTACT 


TTTTCTACCT 


1440 


ACACTCTTTG 


GTCGCTTGAT 


GCTTACACTT 


GGCATTTCAA 


CCATTTTCTT 


GATTATTAAA 


1500 


CGCTTCAAAG 


GTAGTGTCTA 


CGACCGACAA 


CCCTATCTGA 


CTTTGACCTT 


CCCACTTTCT 


1560 


CAACACCATA 


TCATCACAGC 


CAAACTAATC 


GCTGCCTTTA 


TCTGGTCATT 


CATTAGCACC 


1620 


CCTGTATTCG 


CTCTAACTGC 


TCTTATTATT 


crr;ccTTTAA 


CACCTCCACA 


ATGCATTCCT 


1680 


CTTTCTTATG 


TGATTACATT 


TGTAGAAACA 


CATCTCCCTC 


AGATCTTTCT 


TACAGGTATA 


1740 




TAAATACTAT 


TTCAGCAATC 


CTCTCCATCT 


ACCTCGCTAT 


TTCCATTGGA 


1800 


CAGCTTTTCA 


ATGAATACCG 


TACAGCACTC 


GCTGTTGCAG 


TCTACATTGC 


TATCCAAATC 


1860 


GTCATTGGAT 


TTATTGAACT 


TTTCTTCAAT 


CTTAGTTCTA 


ATTTCTATGT 


CAATTCACTG 


1920 


GTAGGACTCA 


ATGACCATTT 


CTATATGGGA 


GCAGGTATAG 


CCATTGTTGA 


AGAACTCATA 


1980 


TTCATAGCTA 


TCTTTTATCT 


CGGAACCTAC 


TACATCTTGA 


GAAATAAGCT 


TAATTTGCTT 


2040 


TAAATAATTT 


TTACCTAGAT 


ATGTAACATA 


CTCATAGAAC 


AAAAGACACC 


AGGCAAAAAG 


2100 


TCTTTAAAAT 


TACAAAACGC 


ATAGTATCAG 


GTGTTGAATA 


TGTACTGCcC 


CCCAAAAGTT 


2160 


AGATTTTTTC 


TGTCTAACTT 


TTGGGGGCAG 


TTCATAAGAA 


CCTTGGTAAT 


ATCCGTTTTT 


2220 
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TGTGAGCTGA 
ATTTTGGAAT 
TTCAGTTCAC 
GGTTCGTGAT 
ATAAGATAAG 
ACTAGCATAC 
AAGTGAAATC 
GAAGGTCGAA 
ACCTAGACCT 
TTTCCCTGAG 
CATGAAAAGA 
ATCCAAAATA 
ACCACCTACA 
CGCATAGAAA 
TTTTTCACTG 
AAAGAATCCT 
ATCAATCTTT 
GGAACCAGTT 
TTTAATTCCA 
TTCATCAATG 

CACTCTTGTC 
AATTCCTACA 
TGACTCGACA 
ACTCCTGATT 
ATATATTGAC 
ATCAATTTCC 
AATTCTTCCT 
ACATCCTGAA 
CAAAGTTCCT 



CTTATTTCCT 
TCAAATCAAT 
TATACAATTG 
TCCACCCTTT 
GCACGTTTAA 
ATGCGTCCCA 
CATGCTTCTC 
ATAAAGGGAA 
GTCACTCCAA 
TCCTCAGGAC 
CTGATCAAAA 
CCAGACTGAG 
ACATAGATCC 
TAGTGACTTC 
GCAACTTCCT 
AAGGCACCTC 
TCTGTGAATT 
GAACGATTAA 
TTTTCAACCC 
GTCAAATAAC 
TACTCGAAGT 
ACTACTGCCA 
GAGATTCCTA 
TGTCGAAGAT 
CTAGTTTAAA 
CTTGAGTCAA 
AACTGCCTTG 
TGCTTCGTTC 
CTGCTCCCTG 
CAACATTGGT 



TTCACTATAT 
TTATAAGAAT 
AGTTTTCAAC 
TCACCTTTAA 
AGCTTTTCXTA 
TAAATCCTGT 
CTCCCCCCCC 
TATAAGAACC 
AAAAACCACC 
GAGAAACCAT 
TAAAGAGCAA 
CCAACAATGG 
CAATATGCGT 
CCCTTATGCT 
GAGCTGTTAC 
CTGCAATTCT 
GAATTGTCTG 
CCTGATTTTG 
ATCTTTCGCC 
CTTTTAATTT 
TAACACCATT 
CTTTATTATT 
AAAAGAGGAA 
AGGTTTCCTT 
GATTTCATCG 
GATTCAGAAG 
TTTCGTCAAG 
ACTTGAAACA 
CAAGACCACA 
CATGACATGG 



320 
CGCAAAATCA 

GTTTTAGAAG 

CAACCTCTTT 

AAACCTCGCT 

AATCCCTAAA 

TCCTACCACC 

ATAGTCATTA 

AATCTTCAAG 

CATAATCAAA 

AGATCCTAGC 

CGTATTCAGT 

CAAATCTTTA 

TAAAATCACT 

AGAAAAAACC 

ACCCGCATAC 

TTGAATAAAC 

CCCTAACCCT 

CAGTTCATTG 

ATCATAAACT 

TTCTTCTTTA 

TACATTCTTC 

TTTAGCCATA 

CGGCGAAATC 

GATTACAACC 

ATAGTTGGCG 

AGTTCCCTTC 

CTCACCTGTT 

AAGAGACGCG 

CGGCCATCTC 

TCAGAAAACA 



AATAAGAACC 
TAATATTATC 
ACATAATGTG 
TTCGCAAGGC 
TCATCCCTTT 
GCAAAAATCA 
ATCGTTCCAA 
AGGAGATTCT 
ATCATCAAAC 
AACGCTGCCA 
GAGATACCAT 
AAGAGCAAAA 
ACAAACAGAG 
ACTTCCATAA 
CTAATCAGAA 
TTTTTATTTT 
TTTTCCTGCT 
AGTGTACCTG 
GCCTTTAGAA 
ATTCCTTCTT 
AGTCCTTCTG 
GAAGAACCTT 
ACCATAAACA 
CACATATTTC 
CTTGTTGCTC 
CAGCGCTCTC 
TGACATGAGC 
TTTTCCCGTA 
GGATCATCAG 
TAATCCTTCT 



CAACGATGGG 
CTATTCCAGA 
TACATAATTA 
TCTTCTATTT 
GAAGAACGAG 
CTGTAATAGC 
ACGGCATAAA 
CACCAGCTGC 
CCCACAAGGC 
AGACTACGTA 
CTCCCAAGTG 
CGGCACCCAG 
CCATCATCCG 
TTTTGCTGCC 
TCATATAAAC 
CCTTCCCTTC 
CTTCAGACAA 
TAACCTCAAA 
CACTATCTTC 
TGGCACTTCC 
CTACAGATGC 
GGAGATGCCC 
AGAAACTCCA 
TCATACTTCC 
AAATGTTGCG 
ATCCTCCAAA 
AAGATTTTCC 
TTGATTGCGG 
AATATCGTCA 
CCGCGCTCTT 



2280 
2340 
2400 
2460 
2520 
2580 
2640 
2*700 
2760 
2820 
2B80 
2940 
3000 
3060 
3120 
3180 
3240 
3300 
3360 
3420 
3480 
3540 
3600 
3660 
3720 
3780 
3840 
3900 
3960 
4020 
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TTTCCTGAAA AATGACTTGT TTGAGCAATT CTGTATTAAC TGGGTCCAAT CCACTAAAAG 4 080 

GCTCATCCAA GATAATCAGG TCTGGTTCAT GAATCAGAGT AATAATGAGC TCAATCTTCT 414 0 

GCTGATTTCC TTTTGACAGA CTCTTGATTT TATCTGTCAG CTTTCCTTTC ACTTCCAACC 4200 

TCTTCATCCA TTGAGGGAGT TTTTCTTTGA CTTCTTTGGC ATCCATGCCT TTTAGAGTCG 4260 

CCAAGTAGCC AACTTGTTCA AGAACTGTCA ATTTAGGCAT GAGATGCGTT CTTCAGGCAC 4 320 

ATAACCAATC CGAGCATAGG TCTCCTGACG AATATCCTGA CCATCCAGAC CGATTTCTCC 4 380 

CTGATATTCT AGGAATTTCA AAATACTATG GAAAATCGTT GTTTTTCCAG CACCATTTTT 44 4 0 

TCCGACTAGT CCCAAAATAC GACCTGGTCG CGCTTGAAAG TCAATACCAA ACAAAACTTG 4 500 

CTTGGATCCA AAACTTTTCT CTAGACTTCT TACTTCTAGC ATCTTTCACC TCCGAAATTT 4 560 

CTTGCACTCA TTATACTCCT TTTTGATAGC CTTTACAATG TTTTTTGTCC ATTTTTAGAA 4 620 

GACTATTGCT GTGTAAAATA TGGCCTCGAG CACTTTTATA CTCAATGAAA ATCAAAGAGC 4 680 

AAACTAGGAA GCTAGCCGTA GACTGCTCAA AGTACAGCTT TGACCTTGCA GATAAAACTG 474 0 

ACCAAGTCgA CTCAAAACAC TGTTTTGAGG TTCTGGATAC AACTGACGAA kCrTAaCTAT 4 800 

ATCTACGGCA AGGCGAAcTG ACGTGGTTTG AAGAGATTTT CGAAGACTAT TACTGATAAA 4 86 0 

TCCATTATAC ACCAGCAAAC TTAATTTATA CCTTCCGCTC CTCAACTGTC TATTTTTAAT 4 92 0 

CCTGAATTGT TATTTGAGTA ACTCCTTTTT CCTCCTAAAG TTTTCTTCCT CTAAAACTTC 4 980 

TGGAAAAAGC CTAATACTTT CAGACAACAT TTTTATAAGA AACAAGTTCA TCTGTCATTT 504 0 

CAAGAAGGAG TAATCCTTTA TCTACTAATC CACCCAACAG AATTCAACCC CTTGTCCGAT 5100 

ATGTTTTCTA AGGATTATAT AGTAAAATGA AATAAGAACA GGACAAATTC ATCAGGACAG 5160 

TCAAATTGAT TTCTAACAAT GTTTTAGAAG TAGATCTATA CTATTCTAGT TTCAATCTGC 5220 

TATATCTATT ATGCACACCC CTATAGGATC TAATGAAAAT CACAACAGGC TCATTCATAG 52 80 

ATGCTTACCT AAGCCTAAGG CAACTAAGAA AACGACTACC AAGGAAGTCG CATTCATCGA 534 0 

AAAGTAGATT AACAACTATC CTAAAAAATG CTTGAACTAC AAGTCCCCCA GAGAAGACTT 5400 

CTGGATGACT AACTTGAACT TGAAATTTAG CAATAATTAA TTCACTATCT AACTATATTT 54 60 

AGTAATTATT TCAGAACTGA TTAATATTAA AATTAACTAA CAATTCAAAG GATTCATACT 5520 

AGCCATAAAT TACGTCCATC AGAGAGAGAC TCTTACTACT TTTAGATTTT AGTCTTTCTA 5580 

GCTTCAGAAT ACATCTAAAC TTTAGGGAAA ATGACTATTC GAAAGCGCGA ATGCCTCAAA 564 0 

ATTATCTCAG ATAAGCTATT CGAAACTTAG AATGCTTTTA AATTTATGGA A7TGCGATTA 57 00 

TTCGAAACCT AGAATGCATA TAACCTTTAG TTGACAGACC TATTCTAAGT CTCCAAGGGC 57 60 
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TATTTACTTT 
TATAGTACAA 
TCTTTCTCCC 
TTTGATTTTC 
TGCCTTTATC 
ACAGACTGTC 
CGGGCACATC 
CTTTTATCTA 
CTACCACAAG 
TATGATGATT 
GATGAGCTTC 
TTATACCTAC 
AAATCCCTTT 
AATATACTTT 
CCTTTCCGCC 
GCCTAGTGTC 
CTTAAGCAAA 
GATTTTCTGC 
CCATTTCCTG 
CCATTAAGAA 
ACCCCTCCAC 
TGACCGATAT 
CGTACACCTG 
TCGAAACGAC 
TAATGACCCA 
GCACCGATAC 
TTACGGGCAA 
ACGTTTAGGG 
CTCGTCCTGA 
CCAACTGCAA 



CTATTCCTTA 
ATATACTATC 
CTTCTCCAGC 
TGTATTTGGG 
CAGCAGGCAG 
CTCCCTATCA 
ATCGGGACTA 
GTGCGCCTAT 
TACATCGACT 
TGGCCCATTA 
AAGCGCTACA 
GGTCTCACCT 
GCTTTCCCAA 
CGTATGGAAA 
GTGATAGAAA 
TCAAAGTTTA 
GGCTCAAAAA 
TTTATTTTGA 
CGACTGCTCG 
GAGATGTAAA 
CAAAGACAAT 
AGTAGGCTTG 
TACGAGCTTC 
AAACACCCTT 
TTTCAGGCTG 
CTGTACCCAT 
CCATTTCACC 
CGCGACGAAG 
TAAAGCCATA 
GACCAGCAAG 



TCAAAAAACA 
TATGAGGAGT 
GAGTTATCAA 
CTTATCACGC 
GCATCTGGCG 
TTCCAGGGGC 
TCTACAACTA 
ACGGAGCTGC 
GGCTAGATAA 
CCCCAGCTGA 
TGACCATCAT 
ATATTATTGA 
GTGGATTTTT 
TCATGCATAT 
CACCTGAAAT 
GCTATCGAAT 
TATTGTTTTC 
AACTTCTTTT 
CCTCACCATA 
TTTCTCACGG 
CACGTCTGGG 
AACATCCCAA 
CAAACTTGGA 
AAACTCTTTT 
ACCCACACCA 
TGTGTACTAA 
GTAAGCAGAG 
GGCACCAAGC 
AGTTTTTGAG 
GTTATCCAAT 



322 
CTCATTCCCC 



TTACATGTCA 
TATCTCATCG 
TCCGATTTTA 
TCCACCTCTC 
CTTGACCTCG 
TATCGCCATC 
CTTTGTCCAG 
GGCCAATCGT 
CTTTCTCTCT 
CATTCTGACC 
CTTTTTCTGG 
AAAGCGTAGA 
TTTTCGATAG 
CTAATGGTTT 
TTTGAAGAAA 
AACCACAAAA 
GCAAGAACAA 
TACTCACCCA 
ACACGGTCCA 
CGCAAAGTCA 
ACAGGGTTCT 
CCAGCTCCAT 
TCAATATCCA 
CCGATAAACT 
ACCAAGTTTT 
CTGTTTACGT 
AAGTCTACAT 
TTTTTGTCAA 
TTTGAGAAGA 



CTTTCTCCTC 
CAGGATAAAC 
ATTGTCGGTG 
CAATCCAAGG 
TTTATCTTTT 
GTGGCTGGGG 
GTGATTGGCT 
TCTGTCGTCA 
TTTGACCGCT 
ATGCTGGCTC 
AAACCCTTTA 
CAAATGCTTT 
TTAACTATAG 
TGAGGCGACG 
CAGCTATTCC 
CTCGCTACCG 
TCCCTTTGCT 
ACTTCCCAAG 
CATCTCCTAC 
GCATATGTTG 
CTGTCGCATT 
TGACTTCAAT 
AACCTTCTAG 
TTCCCTGTCT 
CACCACGTTG 
CGATACCACC 
CTCTTGTGAA 
TTCCCCAGTT 
TATCAATCGG 
ACTCAATGGT 



CAAAATATGG 
AAATGAAAGC 
GCGTTGGCAG 
AAACCCTCTC 
TACAGATTTT 
TCTTTATCTA 
GTGCCATTAT 
GCAAGCGCAC 
TCTTTATTTT 
CCCTGACCAA 
CCCTCGTGGT 
GACACGTAAA 
CTTGATACTA 
ACTTACCTAG 
CAAA C TTTGA 
TCCGTAATCA 
rrCCCAAGCG 
TGTGGCAGAA 
TGGTAGCTAA 
TTGACCCATG 
AACCGCAGCT 
AGTTTCCCCA 
ACATCCCTTA 
AGCAACATAA 
GATGACGCCT 
ACCAGCATTG 
CTACATTGCC 
TGGTTTTGGA 
CCCAAATGAA 
TTTATCGATT 



5820 

5880 

5940 

6000 

6060 

6120 

6IB0 

6240 

6300 

6360 

6420 

6480 

6540 

6600 

6660 

6720 

6780 

6840 

6900 

6960 

7020 

7080 

7140 

7200 

7260 

7320 

7380 

7440 

7500 

7560 



RKJS Daoe 32- 



wo 98/18931 FCT/US97/19588 



323 



GTTTCGATTG 


GAGTTGTTGT 


TGGAAATTGT 


GTTTTTTCTA 


CAACGTTAAA 


GTTTTCATCA 


7620 


CCGACAGCAC 


AGACAAACTT 


TGTACCGCCC 


GCTTCCAAGC 


TTCCATATAA 


TTTTGTCATG 


7680 


ATAAACCTCT 


TGTTTITATT 


TTCTTTATTA 


TAGCATACTT 


CGAAACTCTA 


AATGTCTCTA 


7740 


TTTTTTAGAT 


TTTCCTCTGT 


AAATCTTACT 


ATCTAATAAA 


AACGAACAAA 


CATGTCATTT 


7800 


GTTCGTTTTC 


ACATTAGAGA 


GGATTGATTA 


GATTTTCACT 


TCGATCACAG 


CATCCCCCTT 


7860 


AGCAACTGAA 


CCTGTTGCCA 


CTGGAGCTAC 


TGAAGCGTAG 


TCACCTGTAT 


TTGTAACGAT 


7920 


AACCATTGTT 


GTATCATCAA 


GTCCAGCTGC 


AGCGATTTTG 


TTTGACTCAA 


ATGTTCCAAG 


7980 


AACATCGCCA 


GCTTTCACCT 


TATT AC CTTG 


ACCAACTTTT 


GTTTCAAAAC 


CGTCACCGTT 


8040 


CATAGATACA 


GTATCAATAC 


CAACATGAAT 


CAAAACTTCA 


GCACCATTTC 


TTGTTTTCAA 


BIOO 


ACCAAAAGCG 


TGCCCTGTTG 


GAAAGGCAAT 


TGAAACTTCA 


GCATCAGCTG 


CTGCATAGAC 


8160 


CACGCCTTGG 


CTTGGTTTCA 


CAACGATACC 


TTGTCCCATA 


CCTCCACTTG 


AGAAGACTGC 


6220 


GTCATTGACA 


TCAGCAAGAG 


CGACAACATC 


ACCGACGATA 


GGAGTTACAA 


GTGTTTCATT 


6280 


TTGAAGAGCT 


GCTGGCGCAA 


CTTCTTCTTT 


TTCTTCAGCC 


ACTTCAGCTC 


CTTTTGCACC 


8340 


TGCAGTTGCG 


TCTACTTCAT 


CTTCGTAACC 


AAACATGTAA 


GTAAGAGCAA 


AACCAAGGGC 


8400 


AAATGATACA 


GCTACCATAA 


GAACGTATTG 


TGGAAGTTCT 


CCGTTACCAA 


CATAAAGCAT 


8460 


TGTACCAGGC 


ATGATGGTGA 


TACCATTACC 


ACTACCAGCA 


ACTCCAAGGA 


TACAAGCCAA 


8520 


TCCACCACCG 


ATTGCACCAG 


CAATCAATGA 


AAGCAAGAAT 


CGTTTACGGA 


AGCGCAAGTT 


8580 


CACCCCGAAG 


ATAGCAGGCT 


CTCTAATACC 


TAGGA-\GGCA 


GAAAGAGCAC 


CCGGGAAAGC 


8640 


AAGTGTTTTC 


ACTTTTGGAT 


TTTTTGTTTT 


AACACCAACC 


GCAACACTAC 


CAGCACCTTC 


8700 


AGCrGTCATA 


GCACCTGTCA 


TGATAGCGTT 


GAATGGGTTA 


GCATGGTCAC 


CAGCAAGTAA 


6760 


TTGCACTTCA 


AGCAAGTTGA 


AGATGTGGTG 


CACACCTGAC 


ACGACGATCA 


ATTGCTGAAC 


8820 


CCCACCAATC 


AAGAAACCAC 


CAAGACCAAA 


TGGCATGCTA 


AGAATCGCTT 


TTGTAGCAAT 


8880 


AACGATGTAG 


TTTTCAACAA 


CCTGGAAAAC 


TGCTCCAATG 


ACAAAGACTC 


CAAGGATAGA 


8940 


CATGACCAAA 


AGTGTCACGA 


ATCGTGTTAC 


CAAGACCTCA 


ATGACATCTG 


GAACAACTTG 


9000 


CGGACAGCTT 


TTTCAAATTT 


AGCTCCGACA 


ACCCCGATGA 


TGAAGGCTGG 


AAGAACGGMV 


9060 


CCTTGCAAAC 


CAACAACAGC 


GATGAAACCA 


AAGAACTTCA 


TCGCTGTTAC 


TTCACCACCT 


9120 


TGAGCAACTG 


CCCAAGCGTT 


TGGAAGTGAG 


CCAGAGACAA 


GCATCATACC 


AAGAACGATA 


9180 


CCAACGGCAG 


GATTTCCACC 


AAATACACCG 


AAGGTTGACC 


ACACAACCAA 


ACCTGGCAAG 


9240 


ATGATCAAGG 


CTGTATCTGT 


CAAGATTTGT 


GTGTAAGT*TG 


CAAAGTCACC 


TGGAACTGGC 


9300 
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ATTTCAACAG 


CGTTGAAAAG 


ACCACGCACA 


CCCATGAAGA 


GACCTGTCCC 


TACGATAACT 


9360 


GGCATGATTG 


GAACGAAAAC 


ATCACCAAAA 


GTACCGATAC 


CACGTTGGAA 


CCAGTTCCCT 


9420 


TGTTTACCAA 


CTTCTGCTTT 


CATCTCATCC 


TTAGATCATG 


TTGGTAATCC 


AAGTACAACA 


9480 


ACTTCATCGT 


ACATTTTCTT 


AACTCTACCT 


GTACCAAACA 


TAATTTGCTA 


TTGCCCTGAG 


9540 


TTAAAGAAAG 


CACCTTGAAC 


TTTTTCCAAC 


TTCTCAATCA 


CTTCTTTATT GATTTTCTCT 


9600 


TCATCTTTGA 


CCATGACACC 


TAGACGAGTC 


GCACAGTGGG 


CAACACTATT 


GACATTTTCA 


9660 


CGTCCGCCCA 


AGGCATCGAT 


GACTTTTTTT 


GCAATTTCCT 


CATTCTTCAT 


TTGCAAAAAT 


9720 


CTCCTTATAT 


AACATTTTGT 


TCTTGTTTGA 


AACCGATTTT 


ATTCGCCGG 




9769 



(2 J INFORMATION FOR SEQ ID NO: 31: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 3149 base pairs 

(B) TYPE: nucleic acid 

(C) STRANOEDNESS : double 
(Dl TOPOLOGY: linear 



txi) SEQUENCE DESCRIPTION: SEQ ID NO: 31: 



CGCTTGAGTG 


CTAATTCATA 


CTTCTATTGT 


ATCACTTGGT 


CACAAATAAT 


CAAGAAAAAA 


60 


CTCTGACTTT 


CTCAACATAA 


AAAGCCTCAC 


ACCAACTCAG 


ACTTTTTAAT 


TCTTAAAATG 


120 


GCAATTCTTC 


CTCTTCCAAG 


ACCAAATCTG 


CCAAATCTTC 


CCCTGCATTA 


TTTTCACGCA 


180 


TAGCACGTTG 


GGCACGACTT 


TCCAAGAGTT 


GGAATCCTGT 


GACAAGTACT 


TCGGTCACCT 


240 


ACTTCATTTG 


GC CATTTTTC 


TCAAAGCCAC 


CGCTACGCAA 


TTCTCCATCA 


ACCGAAATGA 


300 


GACTACCTTT 


GCTTGCCTAC 


TTGCCAAACT 


TTCTCCTAGT 


CTGCCCCATA 


GGACCATATT 


360 


GACAAAATCA 


GCTTCACGTT 


CACCGTTTTG 


CTCTTTGTAA 


CGACGGTTCA 


CAGCGATAGT 


420 


TGCTCGCCCT 


ACCCACTTCT 


CATTCTTCCT 


TTTGTCCAAT 


TCTGCTCTAG 


ACCTTAAACG 


4 80 


TCCAATCAAC 


ATAACTTTAT 


TATACATATT 




TACTTATCTA 


TTCGTAGG/J^ 


S40 


ATCAAAAAAA 


GTTACAGAAA 


TTTGTAACTT 


TTCCAGAAAA 


TTTTTTATTT 


TTTATCAACC 


600 


ATGAAACCTG 


TCGCCTGTTG 


ATTGGCCATA 


ATGGTCATAT 


CTGTAATCTC 


AACACGACGA 


660 


GGTTGACTAG 


TCACATAGAC 


TACTCTATCT 


GCAATATCCT 


GAGCTTCCAA 


AGCTTCTATT 


720 


CCTTGGTAAA 


CGGACGCAGC 


TCGTTCTTTA 


TCACCATGAA 


AACGCACTGT 


AGAAAAATCT 


780 


GTTTCGACAA 


TTCCAGGCTG 


AATGCTCGTC 


ACCTTCATAT 


CCGTTGCGAT 


GGTATCAATT 


840 


CGCAGTCCAT 


CTGAAAAGGT 


CTTAACTGCC 


GCCTTGGTCG 


CTGAGTAAAC 


AGCTGCACCA 


900 


GCATAGGCAT 


AAATTCCTGC 


CGTTGACCCC 


ATATTCATAA 


TATGACCTTC 


ATTGCCTTTT 


960 
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What Is Claimed Is: 

1. Computer readable medium having recorded thereon the nucleotide 
sequence depicted in SEQ ED NOS: 1-391, a representative fragment thereof or a 
nucleotide sequence at least 95% identical to a nucleotide sequence depicted in SEQ 
ID NOS: 1-391. 

2. Computer readable medium having recorded thereon any one of the 
fragments of SEQ ID NOS: 1-391 depicted in Tables 2 and 3 or a degenerate variant 
thereof. 

3. The computer readable medium of claim 1, wherein said medium is 
selected from the group consisting of a floppy disc, a hard disc, random access 
memory (RAM), read only memory (ROM), and CD-ROM. 

4. The computer readable medium of claim 3, wherein said medium is 
selected from the group consisting of a floppy disc, a hard disc, random access 
memory (RAM), read only memory (ROM), and CD-ROM. 

5. A computer-based system for identifying fragments of the Streptococcus 
pneumoniae genome of commercial importance comprising the following elements: 

a) a data storage means comprising the nucleotide sequence of SEQ ID 
NOS: 1-391, a representative fragment thereof, or a nucleotide sequence at least 
95% identical to a nucleotide sequence of SEQ ID NOS: 1-391 ; 

b) search means for comparing a target sequence to the nucleotide sequence 
of the data storage means of step (a) to identify homologous sequence(s), and 

c) retrieval means for obtaining said homologous sequence(s) of step (b). 

6. A method for identifying commercially important nucleic acid fragments 
of the Streptococcus pneumoniae genome comprising the step of comparing a 
database comprising the nucleotide sequences depicted in SEQ ID NOS: 1-391, a 
rcpresenutive fragment thereof, or a nucleotide sequence at least 95% identical to a 
nucleotide sequence of SEQ ID NOS: 1-391 with a target sequence to obtain a 
nucleic acid molecule comprised of a complementary nucleotide sequence to said " 
target sequence, wherein said target sequence is not randomly selected. 
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7. A method for identifying an expression modulating fragment of 
Streptococcus pneumoniae genome comprising the step of comparing a database 
comprising the nucleotide sequences depicted in SEQ ID NOS: 1-391, a 
representative fragment thereof, or a nucleotide sequence at least 95% identical to 
the nucleotide sequence of SEQ ID NOS: 1-391 with a target sequence to obtain a 
nucleic acid molecule comprised of a complementary nucleotide sequence to said 
target sequence, wherein said target sequence comprises sequences known to 
regulate gene expression. 

8. An isolated protein-encoding nucleic acid fragment of the Streptococcus 
pneumoniae genome, wherein said fragment consists of the nucleotide sequence of 
any one of the fragments of SEQ ID NOS: 1-391 depicted in Tables 2 and 3, or a 
degenerate variant thereof. 

9. A vector comprising any one of the fragments of the Streptococcus 
pneumoniae genome SEQ ID NOS: 1-391 depicted in Tables 2 and 3 or a 



10. An isolated fragment of the Streptococcus pneumoniae genome, 
wherein said fragment modulates the expression of an operably linked open reading 
frame, wherein said fragment consists of the nucleotide sequence from about 10 to 
200 bases in length which is 5* to any one of the open reading frames depicted in 
Tables 2 and 3 or a degenerate variant thereof. 

11. A vector comprising any one of the fragments of the Streptococcus 
pneumoniae genome of claim 8. 



12. An organism which has been altered to contain any one of the 
fragments of the Streptococcus pneumoniae genome of claim 8. 

13. An organism which has been altered to contain any one of the 
fragments of the Streptococcus pneumoniae genome of claim 10. 



75 



degenerate variant thereof. 



85 
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14. A method for regulating the expression of a nucleic acid molecule 
comprising the step of covalently attaching to said nucleic acid molecule a nucleic 
acid molecule consisting of the nucleotide sequence from about 10 to 100 bases 5' 
to any one of the fragments of the Streptococcus pneumoniae genome depicted in 
SEQ ID NOS: 1-391 and Tables 2 and 3 or a degenerate variant thereof, 

15. An isolated nucleic acid molecule encoding a hpmolog of any of the 
fragments of the Streptococcus pneumoniae genome of SEQ ID NOS: 1-391 and 
Tables 2 and 3, wherein said nucleic acid molecule is produced by a process 
comprising steps of: 

a) screening a genomic DNA library using as a probe a target sequence 
defined by any of SEQ ED NOS: 1-391 and Tables 2 and 3, including fragments 
thereof; 

b) identifying members of said library which contain sequences that 
hybridize to said target sequence; and 

c) isolating the nucleic acid molecules from said members identified in step 

(b). 

16. An isolated DNA molecule encoding a homolog of any one of the 
fragments of the Streptococcus pneumoniae genome of SEQ ID NOS: 1-391 and 
Tables 2 and 3» wherein said nucleic acid molecule is produced a process 
comprising steps of: 

a) isolating mRNA, DNA, or cDN A produced from an organism; 

b) amplifying nucleic acid molecules whose nucleotide sequence is 
homologous to amplification primers derived from said fragment of said 
Streptococcus pneumoniae genome to prime said amplification; 

c) isolating said amplified sequences produced in step (b). 

J 7. An isolated polypeptide encoded by any of the fragments of the 
Streptococcus pneumoniae genome of SEQ ID NOS: 1-391 and depicted in Table 2 
and 3 or by a degenerate variant of said fragments. 

18. An isolated polynucleotide molecule encoding any one of the - 
polypeptides of claim 17. 





wo 9S/18931 



PCT/US97/19588 



1405 



19. An antibody which selectively binds to any one of the polypeptides of 



20. A method for producing a polypeptide in a host ceU comprising the 



a) incubating a host containing a heterologous nucleic acid molecule whose 
nucleotide sequence consists of any one of the fragments of the Streptococcus 
pneumoniae genome of SEQ ID NOS: 1-391 and depicted in Tables 2 and 3, under 
conditions where said heterologous nucleic acid molecule is expressed to produce 
said protein, and 

b) isolating said protein. 



claim 17. 



130 
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1, Claims: 1-7 

Computer readable medium having recorded thereon the 
nucleotide sequence depicted in SEQ ID nos. 1-391, a 
representative fragment thereof or a nucleotide sequence at 
least 95% identical to a nucleotide sequence depicted in SEQ 
ID nos. 1-391; a computer-based system for identifying 
fragments of the Streptococcus pneumoniae genome of 
conmercial importance comprising: a) a data storage means 
comprising said nucleotide sequence(s); b) search means for 
comparing a target sequence to the nucleotide sequence of 
the data storage means of step (a) to identify homologous 
sequence(s), and c) retrieval means for obtaining said 
homologous sequence(s) of step (b); a method for identifying 
conmercially important nucleic acid fragments of the 
Streptococcus pneumoniae genome comprising the step of 
comparing a database comprising said nucleotide sequence(s) 
with a target sequence to obtain a nucleic acid molecule 
comprised of a complementary nucleotide sequence to said 
target sequence, wherein said target sequence is not 
randomly selected; a method for identifying an expression 
modulating fragments of the Streptococcus pneumoniae genome 
comprising the step of comparing a database comprising said 
nucleotide sequence(s) with a target sequence to obtain a 
nucleic acid molecule comprised of a complementary 
nucleotide sequence to said target sequence « wherein said 
target sequence comprises sequences known to regulate gene 
expression; 



2. Claims: (8-20) partially 

An isolated protein-encoded nucleic acid fragment of the 
Streptococcus pneumoniae genome, wherein said fragment 
consists of the nucleotide sequence of the fragment of 
SEQ ID no.l depicted in Tables 2 and 3, or a degenerate 
variant thereof; a vector comprising the fragment of the 
Streptococcus ^pneumoniae genome SEQ ID no.l; an isolated 
fragment of the Streptococcus pneumoniae genome, wherein 
said fragment modulates the expression of an operably linked 
open reading frame, wherein said fragment consists of the 
nucleotide sequence from about 10 to 200 bases in length 
which is 5' to any one of the open reading frame of SEQ ID 
no.l depicted in Tables 2 and 3 or a degenerate variant 
thereof; a method for regulating the expression of a nucleic 
acid molecule comprising the step of covalently attaching to 
said nucleic acid molecule a nucleic acid molecule 
consisting of the nucleotide sequence from about 10 to 100 
bases 5' to any one of the open reading frame of SEQ ID 
no.l and Tables 2 and 3 or a degenerate variant thereof; an 
isolated nucleic acid molecule encoding a homolog of SEQ 
ID no.l; an isolated polypeptide encoded by SEQ ID no.l and 
depicted. in Table 2 and 3; an antibody which selectively 
binds to any one of said polypeptides, a method for 
producing a polypeptide* in a host cell comprising a) 
incubating a host containing a heterologous nucleic acid 
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molecule whose nucleotide sequence consists of SEQ ID no.l 
and depicted in Table 2 and 3, under conditions where said 
heterologous nucleic acid molecule is expressed to produce 
said protein, and b) isolating said protein; 

3-392- Claims: (8-20) partially 

Idem as subject 2 but limited to each of the sequences of 
SEQ ID no. 2 to 391; 

For the sake of conciseness, the second subject matter is 
explicitly defined, the other subject matters are defined by 
analogy hereto. 
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